Source of the materials: Biopython cookbook (adapted) Status: Draft

Cookbook – Cool things to do with it

Biopython now has two collections of “cookbook” examples – this chapter (which has been included in this tutorial for many years and has gradually grown), and http://biopython.org/wiki/Category:Cookbook which is a user contributed collection on our wiki.

We’re trying to encourage Biopython users to contribute their own examples to the wiki. In addition to helping the community, one direct benefit of sharing an example like this is that you could also get some feedback on the code from other Biopython users and developers - which could help you improve all your Python code.

In the long term, we may end up moving all of the examples in this chapter to the wiki, or elsewhere within the tutorial.

Working with sequence files

This section shows some more examples of sequence input/output, using the Bio.SeqIO module described in Chapter [chapter:Bio.SeqIO].

Filtering a sequence file

Often you’ll have a large file with many sequences in it (e.g. FASTA file or genes, or a FASTQ or SFF file of reads), a separate shorter list of the IDs for a subset of sequences of interest, and want to make a new sequence file for this subset.

Let’s say the list of IDs is in a simple text file, as the first word on each line. This could be a tabular file where the first column is the ID. Try something like this:

from Bio import SeqIO
input_file = "big_file.sff"
id_file = "short_list.txt"
output_file = "short_list.sff"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print("Found %i unique identifiers in %s" % (len(wanted), id_file))
records = (r for r in SeqIO.parse(input_file, "sff") if r.id in wanted)
count = SeqIO.write(records, output_file, "sff")
print("Saved %i records from %s to %s" % (count, input_file, output_file))
if count < len(wanted):
    print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

Note that we use a Python set rather than a list, this makes testing membership faster.

Producing randomised genomes

Let’s suppose you are looking at genome sequence, hunting for some sequence feature – maybe extreme local GC% bias, or possible restriction digest sites. Once you’ve got your Python code working on the real genome it may be sensible to try running the same search on randomised versions of the same genome for statistical analysis (after all, any “features” you’ve found could just be there just by chance).

For this discussion, we’ll use the GenBank file for the pPCP1 plasmid from Yersinia pestis biovar Microtus. The file is included with the Biopython unit tests under the GenBank folder, or you can get it from our website, `NC_005816.gb <http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb>`__. This file contains one and only one record, so we can read it in as a SeqRecord using the Bio.SeqIO.read() function:

In [2]:
from Bio import SeqIO
original_rec = SeqIO.read("data/NC_005816.gb", "genbank")

So, how can we generate a shuffled versions of the original sequence? I would use the built in Python random module for this, in particular the function random.shuffle – but this works on a Python list. Our sequence is a Seq object, so in order to shuffle it we need to turn it into a list:

In [3]:
import random
nuc_list = list(original_rec.seq)
random.shuffle(nuc_list) #acts in situ!

Now, in order to use Bio.SeqIO to output the shuffled sequence, we need to construct a new SeqRecord with a new Seq object using this shuffled list. In order to do this, we need to turn the list of nucleotides (single letter strings) into a long string – the standard Python way to do this is with the string object’s join method.

In [4]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
shuffled_rec = SeqRecord(Seq("".join(nuc_list), original_rec.seq.alphabet),
    id="Shuffled", description="Based on %s" % original_rec.id)

Let’s put all these pieces together to make a complete Python script which generates a single FASTA file containing 30 randomly shuffled versions of the original sequence.

This first version just uses a big for loop and writes out the records one by one (using the SeqRecord’s format method described in Section [sec:Bio.SeqIO-and-StringIO]):

import random
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

original_rec = SeqIO.read("NC_005816.gb","genbank")

handle = open("shuffled.fasta", "w")
for i in range(30):
    nuc_list = list(original_rec.seq)
    random.shuffle(nuc_list)
    shuffled_rec = SeqRecord(Seq("".join(nuc_list), original_rec.seq.alphabet), \
                             id="Shuffled%i" % (i+1), \
                             description="Based on %s" % original_rec.id)
    handle.write(shuffled_rec.format("fasta"))
handle.close()

Personally I prefer the following version using a function to shuffle the record and a generator expression instead of the for loop:

import random
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

def make_shuffle_record(record, new_id):
    nuc_list = list(record.seq)
    random.shuffle(nuc_list)
    return SeqRecord(Seq("".join(nuc_list), record.seq.alphabet), \
           id=new_id, description="Based on %s" % original_rec.id)

original_rec = SeqIO.read("NC_005816.gb","genbank")
shuffled_recs = (make_shuffle_record(original_rec, "Shuffled%i" % (i+1)) \
                 for i in range(30))
handle = open("shuffled.fasta", "w")
SeqIO.write(shuffled_recs, handle, "fasta")
handle.close()

Translating a FASTA file of CDS entries

Suppose you’ve got an input file of CDS entries for some organism, and you want to generate a new FASTA file containing their protein sequences. i.e. Take each nucleotide sequence from the original file, and translate it. Back in Section [sec:translation] we saw how to use the Seq object’s translate method, and the optional cds argument which enables correct translation of alternative start codons.

We can combine this with Bio.SeqIO as shown in the reverse complement example in Section [sec:SeqIO-reverse-complement]. The key point is that for each nucleotide SeqRecord, we need to create a protein SeqRecord - and take care of naming it.

You can write you own function to do this, choosing suitable protein identifiers for your sequences, and the appropriate genetic code. In this example we just use the default table and add a prefix to the identifier:

from Bio.SeqRecord import SeqRecord
def make_protein_record(nuc_record):
    """Returns a new SeqRecord with the translated sequence (default table)."""
    return SeqRecord(seq = nuc_record.seq.translate(cds=True), \
                     id = "trans_" + nuc_record.id, \
                     description = "translation of CDS, using default table")

We can then use this function to turn the input nucleotide records into protein records ready for output. An elegant way and memory efficient way to do this is with a generator expression:

from Bio import SeqIO
proteins = (make_protein_record(nuc_rec) for nuc_rec in \
            SeqIO.parse("coding_sequences.fasta", "fasta"))
SeqIO.write(proteins, "translations.fasta", "fasta")

This should work on any FASTA file of complete coding sequences. If you are working on partial coding sequences, you may prefer to use nuc_record.seq.translate(to_stop=True) in the example above, as this wouldn’t check for a valid start codon etc.

Making the sequences in a FASTA file upper case

Often you’ll get data from collaborators as FASTA files, and sometimes the sequences can be in a mixture of upper and lower case. In some cases this is deliberate (e.g. lower case for poor quality regions), but usually it is not important. You may want to edit the file to make everything consistent (e.g. all upper case), and you can do this easily using the upper() method of the SeqRecord object (added in Biopython 1.55):

from Bio import SeqIO
records = (rec.upper() for rec in SeqIO.parse("mixed.fas", "fasta"))
count = SeqIO.write(records, "upper.fas", "fasta")
print("Converted %i records to upper case" % count)

How does this work? The first line is just importing the Bio.SeqIO module. The second line is the interesting bit – this is a Python generator expression which gives an upper case version of each record parsed from the input file (mixed.fas). In the third line we give this generator expression to the Bio.SeqIO.write() function and it saves the new upper cases records to our output file (upper.fas).

The reason we use a generator expression (rather than a list or list comprehension) is this means only one record is kept in memory at a time. This can be really important if you are dealing with large files with millions of entries.

Sorting a sequence file

Suppose you wanted to sort a sequence file by length (e.g. a set of contigs from an assembly), and you are working with a file format like FASTA or FASTQ which Bio.SeqIO can read, write (and index).

If the file is small enough, you can load it all into memory at once as a list of SeqRecord objects, sort the list, and save it:

from Bio import SeqIO
records = list(SeqIO.parse("ls_orchid.fasta","fasta"))
records.sort(cmp=lambda x,y: cmp(len(x),len(y)))
SeqIO.write(records, "sorted_orchids.fasta", "fasta")

The only clever bit is specifying a comparison function for how to sort the records (here we sort them by length). If you wanted the longest records first, you could flip the comparison or use the reverse argument:

from Bio import SeqIO
records = list(SeqIO.parse("ls_orchid.fasta","fasta"))
records.sort(cmp=lambda x,y: cmp(len(y),len(x)))
SeqIO.write(records, "sorted_orchids.fasta", "fasta")

Now that’s pretty straight forward - but what happens if you have a very large file and you can’t load it all into memory like this? For example, you might have some next-generation sequencing reads to sort by length. This can be solved using the Bio.SeqIO.index() function.

from Bio import SeqIO
#Get the lengths and ids, and sort on length
len_and_ids = sorted((len(rec), rec.id) for rec in \
                     SeqIO.parse("ls_orchid.fasta","fasta"))
ids = reversed([id for (length, id) in len_and_ids])
del len_and_ids #free this memory
record_index = SeqIO.index("ls_orchid.fasta", "fasta")
records = (record_index[id] for id in ids)
SeqIO.write(records, "sorted.fasta", "fasta")

First we scan through the file once using Bio.SeqIO.parse(), recording the record identifiers and their lengths in a list of tuples. We then sort this list to get them in length order, and discard the lengths. Using this sorted list of identifiers Bio.SeqIO.index() allows us to retrieve the records one by one, and we pass them to Bio.SeqIO.write() for output.

These examples all use Bio.SeqIO to parse the records into SeqRecord objects which are output using Bio.SeqIO.write(). What if you want to sort a file format which Bio.SeqIO.write() doesn’t support, like the plain text SwissProt format? Here is an alternative solution using the get_raw() method added to Bio.SeqIO.index() in Biopython 1.54 (see Section [sec:seqio-index-getraw]).

from Bio import SeqIO
#Get the lengths and ids, and sort on length
len_and_ids = sorted((len(rec), rec.id) for rec in \
                     SeqIO.parse("ls_orchid.fasta","fasta"))
ids = reversed([id for (length, id) in len_and_ids])
del len_and_ids #free this memory
record_index = SeqIO.index("ls_orchid.fasta", "fasta")
handle = open("sorted.fasta", "w")
for id in ids:
    handle.write(record_index.get_raw(id))
handle.close()

As a bonus, because it doesn’t parse the data into SeqRecord objects a second time it should be faster.

Simple quality filtering for FASTQ files

The FASTQ file format was introduced at Sanger and is now widely used for holding nucleotide sequencing reads together with their quality scores. FASTQ files (and the related QUAL files) are an excellent example of per-letter-annotation, because for each nucleotide in the sequence there is an associated quality score. Any per-letter-annotation is held in a SeqRecord in the letter_annotations dictionary as a list, tuple or string (with the same number of elements as the sequence length).

One common task is taking a large set of sequencing reads and filtering them (or cropping them) based on their quality scores. The following example is very simplistic, but should illustrate the basics of working with quality data in a SeqRecord object. All we are going to do here is read in a file of FASTQ data, and filter it to pick out only those records whose PHRED quality scores are all above some threshold (here 20).

For this example we’ll use some real data downloaded from the ENA sequence read archive, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz (2MB) which unzips to a 19MB file SRR020192.fastq. This is some Roche 454 GS FLX single end data from virus infected California sea lions (see http://www.ebi.ac.uk/ena/data/view/SRS004476 for details).

First, let’s count the reads:

from Bio import SeqIO
count = 0
for rec in SeqIO.parse("SRR020192.fastq", "fastq"):
    count += 1
print("%i reads" % count)

Now let’s do a simple filtering for a minimum PHRED quality of 20:

from Bio import SeqIO
good_reads = (rec for rec in \
              SeqIO.parse("SRR020192.fastq", "fastq") \
              if min(rec.letter_annotations["phred_quality"]) >= 20)
count = SeqIO.write(good_reads, "good_quality.fastq", "fastq")
print("Saved %i reads" % count)

This pulled out only \(14580\) reads out of the \(41892\) present. A more sensible thing to do would be to quality trim the reads, but this is intended as an example only.

FASTQ files can contain millions of entries, so it is best to avoid loading them all into memory at once. This example uses a generator expression, which means only one SeqRecord is created at a time - avoiding any memory limitations.

Trimming off primer sequences

For this example we’re going to pretend that GATGACGGTGT is a 5’ primer sequence we want to look for in some FASTQ formatted read data. As in the example above, we’ll use the SRR020192.fastq file downloaded from the ENA (ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz). The same approach would work with any other supported file format (e.g. FASTA files).

This code uses Bio.SeqIO with a generator expression (to avoid loading all the sequences into memory at once), and the Seq object’s startswith method to see if the read starts with the primer sequence:

from Bio import SeqIO
primer_reads = (rec for rec in \
                SeqIO.parse("SRR020192.fastq", "fastq") \
                if rec.seq.startswith("GATGACGGTGT"))
count = SeqIO.write(primer_reads, "with_primer.fastq", "fastq")
print("Saved %i reads" % count)

That should find \(13819\) reads from SRR014849.fastq and save them to a new FASTQ file, with_primer.fastq.

Now suppose that instead you wanted to make a FASTQ file containing these reads but with the primer sequence removed? That’s just a small change as we can slice the SeqRecord (see Section [sec:SeqRecord-slicing]) to remove the first eleven letters (the length of our primer):

from Bio import SeqIO
trimmed_primer_reads = (rec[11:] for rec in \
                        SeqIO.parse("SRR020192.fastq", "fastq") \
                        if rec.seq.startswith("GATGACGGTGT"))
count = SeqIO.write(trimmed_primer_reads, "with_primer_trimmed.fastq", "fastq")
print("Saved %i reads" % count)

Again, that should pull out the \(13819\) reads from SRR020192.fastq, but this time strip off the first ten characters, and save them to another new FASTQ file, with_primer_trimmed.fastq.

Finally, suppose you want to create a new FASTQ file where these reads have their primer removed, but all the other reads are kept as they were? If we want to still use a generator expression, it is probably clearest to define our own trim function:

from Bio import SeqIO
def trim_primer(record, primer):
    if record.seq.startswith(primer):
        return record[len(primer):]
    else:
        return record

trimmed_reads = (trim_primer(record, "GATGACGGTGT") for record in \
                 SeqIO.parse("SRR020192.fastq", "fastq"))
count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
print("Saved %i reads" % count)

This takes longer, as this time the output file contains all \(41892\) reads. Again, we’re used a generator expression to avoid any memory problems. You could alternatively use a generator function rather than a generator expression.

from Bio import SeqIO
def trim_primers(records, primer):
    """Removes perfect primer sequences at start of reads.

    This is a generator function, the records argument should
    be a list or iterator returning SeqRecord objects.
    """
    len_primer = len(primer) #cache this for later
    for record in records:
        if record.seq.startswith(primer):
            yield record[len_primer:]
        else:
            yield record

original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
print("Saved %i reads" % count)

This form is more flexible if you want to do something more complicated where only some of the records are retained – as shown in the next example.

Trimming off adaptor sequences

This is essentially a simple extension to the previous example. We are going to going to pretend GATGACGGTGT is an adaptor sequence in some FASTQ formatted read data, again the SRR020192.fastq file from the NCBI (ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz).

This time however, we will look for the sequence anywhere in the reads, not just at the very beginning:

from Bio import SeqIO

def trim_adaptors(records, adaptor):
    """Trims perfect adaptor sequences.

    This is a generator function, the records argument should
    be a list or iterator returning SeqRecord objects.
    """
    len_adaptor = len(adaptor) #cache this for later
    for record in records:
        index = record.seq.find(adaptor)
        if index == -1:
            #adaptor not found, so won't trim
            yield record
        else:
            #trim off the adaptor
            yield record[index+len_adaptor:]

original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
trimmed_reads = trim_adaptors(original_reads, "GATGACGGTGT")
count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
print("Saved %i reads" % count)

Because we are using a FASTQ input file in this example, the SeqRecord objects have per-letter-annotation for the quality scores. By slicing the SeqRecord object the appropriate scores are used on the trimmed records, so we can output them as a FASTQ file too.

Compared to the output of the previous example where we only looked for a primer/adaptor at the start of each read, you may find some of the trimmed reads are quite short after trimming (e.g. if the adaptor was found in the middle rather than near the start). So, let’s add a minimum length requirement as well:

from Bio import SeqIO

def trim_adaptors(records, adaptor, min_len):
    """Trims perfect adaptor sequences, checks read length.

    This is a generator function, the records argument should
    be a list or iterator returning SeqRecord objects.
    """
    len_adaptor = len(adaptor) #cache this for later
    for record in records:
        len_record = len(record) #cache this for later
        if len(record) < min_len:
           #Too short to keep
           continue
        index = record.seq.find(adaptor)
        if index == -1:
            #adaptor not found, so won't trim
            yield record
        elif len_record - index - len_adaptor >= min_len:
            #after trimming this will still be long enough
            yield record[index+len_adaptor:]

original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
trimmed_reads = trim_adaptors(original_reads, "GATGACGGTGT", 100)
count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
print("Saved %i reads" % count)

By changing the format names, you could apply this to FASTA files instead. This code also could be extended to do a fuzzy match instead of an exact match (maybe using a pairwise alignment, or taking into account the read quality scores), but that will be much slower.

Converting FASTQ files

Back in Section [sec:SeqIO-conversion] we showed how to use Bio.SeqIO to convert between two file formats. Here we’ll go into a little more detail regarding FASTQ files which are used in second generation DNA sequencing. Please refer to Cock et al. (2009) @cock2010 for a longer description. FASTQ files store both the DNA sequence (as a string) and the associated read qualities.

PHRED scores (used in most FASTQ files, and also in QUAL files, ACE files and SFF files) have become a de facto standard for representing the probability of a sequencing error (here denoted by \(P_e\)) at a given base using a simple base ten log transformation:

\[Q_{\textrm{PHRED}} = - 10 \times \textrm{log}_{10} ( P_e )\]

This means a wrong read (\(P_e = 1\)) gets a PHRED quality of \(0\), while a very good read like \(P_e = 0.00001\) gets a PHRED quality of \(50\). While for raw sequencing data qualities higher than this are rare, with post processing such as read mapping or assembly, qualities of up to about \(90\) are possible (indeed, the MAQ tool allows for PHRED scores in the range 0 to 93 inclusive).

The FASTQ format has the potential to become a de facto standard for storing the letters and quality scores for a sequencing read in a single plain text file. The only fly in the ointment is that there are at least three versions of the FASTQ format which are incompatible and difficult to distinguish...

  1. The original Sanger FASTQ format uses PHRED qualities encoded with an ASCII offset of 33. The NCBI are using this format in their Short Read Archive. We call this the fastq (or fastq-sanger) format in Bio.SeqIO.
  2. Solexa (later bought by Illumina) introduced their own version using Solexa qualities encoded with an ASCII offset of 64. We call this the fastq-solexa format.
  3. Illumina pipeline 1.3 onwards produces FASTQ files with PHRED qualities (which is more consistent), but encoded with an ASCII offset of 64. We call this the fastq-illumina format.

The Solexa quality scores are defined using a different log transformation:

\[Q_{\textrm{Solexa}} = - 10 \times \textrm{log}_{10} \left( \frac{P_e}{1-P_e} \right)\]

Given Solexa/Illumina have now moved to using PHRED scores in version 1.3 of their pipeline, the Solexa quality scores will gradually fall out of use. If you equate the error estimates (\(P_e\)) these two equations allow conversion between the two scoring systems - and Biopython includes functions to do this in the Bio.SeqIO.QualityIO module, which are called if you use Bio.SeqIO to convert an old Solexa/Illumina file into a standard Sanger FASTQ file:

from Bio import SeqIO
SeqIO.convert("solexa.fastq", "fastq-solexa", "standard.fastq", "fastq")

If you want to convert a new Illumina 1.3+ FASTQ file, all that gets changed is the ASCII offset because although encoded differently the scores are all PHRED qualities:

from Bio import SeqIO
SeqIO.convert("illumina.fastq", "fastq-illumina", "standard.fastq", "fastq")

Note that using Bio.SeqIO.convert() like this is much faster than combining Bio.SeqIO.parse() and Bio.SeqIO.write() because optimised code is used for converting between FASTQ variants (and also for FASTQ to FASTA conversion).

For good quality reads, PHRED and Solexa scores are approximately equal, which means since both the fasta-solexa and fastq-illumina formats use an ASCII offset of 64 the files are almost the same. This was a deliberate design choice by Illumina, meaning applications expecting the old fasta-solexa style files will probably be OK using the newer fastq-illumina files (on good data). Of course, both variants are very different from the original FASTQ standard as used by Sanger, the NCBI, and elsewhere (format name fastq or fastq-sanger).

For more details, see the built in help (also online):

In [5]:
from Bio.SeqIO import QualityIO
help(QualityIO)

Help on module Bio.SeqIO.QualityIO in Bio.SeqIO:

NAME
    Bio.SeqIO.QualityIO - Bio.SeqIO support for the FASTQ and QUAL file formats.

DESCRIPTION
    Note that you are expected to use this code via the Bio.SeqIO interface, as
    shown below.

    The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute
    to bundle a FASTA sequence and its PHRED quality data (integers between 0 and
    90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files
    are used containing the sequence and the quality information separately.

    The PHRED software reads DNA sequencing trace files, calls bases, and
    assigns a non-negative quality value to each called base using a logged
    transformation of the error probability, Q = -10 log10( Pe ), for example::

        Pe = 1.0,         Q =  0
        Pe = 0.1,         Q = 10
        Pe = 0.01,        Q = 20
        ...
        Pe = 0.00000001,  Q = 80
        Pe = 0.000000001, Q = 90

    In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40.
    In the QUAL format these quality values are held as space separated text in
    a FASTA like file format.  In the FASTQ format, each quality values is encoded
    with a single ASCI character using chr(Q+33), meaning zero maps to the
    character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard
    the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and
    quality are then stored in pairs in a FASTA like format.

    Unfortunately there is no official document describing the FASTQ file format,
    and worse, several related but different variants exist. For more details,
    please read this open access publication::

        The Sanger FASTQ file format for sequences with quality scores, and the
        Solexa/Illumina FASTQ variants.
        P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby),
        M.L.Heuer (BioJava) and P.M. Rice (EMBOSS).
        Nucleic Acids Research 2010 38(6):1767-1771
        http://dx.doi.org/10.1093/nar/gkp1137

    The good news is that Roche 454 sequencers can output files in the QUAL format,
    and sensibly they use PHREP style scores like Sanger.  Converting a pair of
    FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL
    files from a Roche 454 SFF binary file, use the Roche off instrument command
    line tool "sffinfo" with the -q or -qual argument.  You can extract a matching
    FASTA file using the -s or -seq argument instead.

    The bad news is that Solexa/Illumina did things differently - they have their
    own scoring system AND their own incompatible versions of the FASTQ format.
    Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can
    be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a
    reasonable mapping can be achieved between them, and they are approximately
    equal for higher quality reads).

    Confusingly early Solexa pipelines produced a FASTQ like file but using their
    own score mapping and an ASCII offset of 64. To make things worse, for the
    Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the
    FASTQ file format, this time using PHRED scores (which is more consistent) but
    with an ASCII offset of 64.

    i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ
    file format: The original Sanger PHRED standard, and two from Solexa/Illumina.

    The good news is that as of CASAVA version 1.8, Illumina sequencers will
    produce FASTQ files using the standard Sanger encoding.

    You are expected to use this module via the Bio.SeqIO functions, with the
    following format names:

        - "qual" means simple quality files using PHRED scores (e.g. from Roche 454)
        - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII
          offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+).
          These can potentially hold PHRED scores from 0 to 93.
        - "fastq-sanger" is an alias for "fastq".
        - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ
          files, using Solexa scores with an ASCII offset 64. These can hold Solexa
          scores from -5 to 62.
        - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using
          PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0
          to 62.

    We could potentially add support for "qual-solexa" meaning QUAL files which
    contain Solexa scores, but thus far there isn't any reason to use such files.

    For example, consider the following short FASTQ file::

        @EAS54_6_R1_2_1_413_324
        CCCTTCTTGTCTTCAGCGTTTCTCC
        +
        ;;3;;;;;;;;;;;;7;;;;;;;88
        @EAS54_6_R1_2_1_540_792
        TTGGCAGGCCAAGGCCGATGGATCA
        +
        ;;;;;;;;;;;7;;;;;-;;;3;83
        @EAS54_6_R1_2_1_443_348
        GTTGCTTCTGGCGTGGGTGGGGGGG
        +
        ;;;;;;;;;;;9;7;;.7;393333

    This contains three reads of length 25.  From the read length these were
    probably originally from an early Solexa/Illumina sequencer but this file
    follows the Sanger FASTQ convention (PHRED style qualities with an ASCII
    offet of 33).  This means we can parse this file using Bio.SeqIO using
    "fastq" as the format name:

    >>> from Bio import SeqIO
    >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"):
    ...     print("%s %s" % (record.id, record.seq))
    EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
    EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
    EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG

    The qualities are held as a list of integers in each record's annotation:

    >>> print(record)
    ID: EAS54_6_R1_2_1_443_348
    Name: EAS54_6_R1_2_1_443_348
    Description: EAS54_6_R1_2_1_443_348
    Number of features: 0
    Per letter annotation for: phred_quality
    Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet())
    >>> print(record.letter_annotations["phred_quality"])
    [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]

    You can use the SeqRecord format method to show this in the QUAL format:

    >>> print(record.format("qual"))
    >EAS54_6_R1_2_1_443_348
    26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
    24 18 18 18 18
    <BLANKLINE>

    Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"):

    >>> print(record.format("fastq"))
    @EAS54_6_R1_2_1_443_348
    GTTGCTTCTGGCGTGGGTGGGGGGG
    +
    ;;;;;;;;;;;9;7;;.7;393333
    <BLANKLINE>

    Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset
    of 64):

    >>> print(record.format("fastq-illumina"))
    @EAS54_6_R1_2_1_443_348
    GTTGCTTCTGGCGTGGGTGGGGGGG
    +
    ZZZZZZZZZZZXZVZZMVZRXRRRR
    <BLANKLINE>

    You can also get Biopython to convert the scores and show a Solexa style
    FASTQ file:

    >>> print(record.format("fastq-solexa"))
    @EAS54_6_R1_2_1_443_348
    GTTGCTTCTGGCGTGGGTGGGGGGG
    +
    ZZZZZZZZZZZXZVZZMVZRXRRRR
    <BLANKLINE>

    Notice that this is actually the same output as above using "fastq-illumina"
    as the format! The reason for this is all these scores are high enough that
    the PHRED and Solexa scores are almost equal. The differences become apparent
    for poor quality reads. See the functions solexa_quality_from_phred and
    phred_quality_from_solexa for more details.

    If you wanted to trim your sequences (perhaps to remove low quality regions,
    or to remove a primer sequence), try slicing the SeqRecord objects.  e.g.

    >>> sub_rec = record[5:15]
    >>> print(sub_rec)
    ID: EAS54_6_R1_2_1_443_348
    Name: EAS54_6_R1_2_1_443_348
    Description: EAS54_6_R1_2_1_443_348
    Number of features: 0
    Per letter annotation for: phred_quality
    Seq('TTCTGGCGTG', SingleLetterAlphabet())
    >>> print(sub_rec.letter_annotations["phred_quality"])
    [26, 26, 26, 26, 26, 26, 24, 26, 22, 26]
    >>> print(sub_rec.format("fastq"))
    @EAS54_6_R1_2_1_443_348
    TTCTGGCGTG
    +
    ;;;;;;9;7;
    <BLANKLINE>

    If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:

    >>> from Bio import SeqIO
    >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
    >>> with open("Quality/temp.qual", "w") as out_handle:
    ...     SeqIO.write(record_iterator, out_handle, "qual")
    3

    You can of course read in a QUAL file, such as the one we just created:

    >>> from Bio import SeqIO
    >>> for record in SeqIO.parse("Quality/temp.qual", "qual"):
    ...     print("%s %s" % (record.id, record.seq))
    EAS54_6_R1_2_1_413_324 ?????????????????????????
    EAS54_6_R1_2_1_540_792 ?????????????????????????
    EAS54_6_R1_2_1_443_348 ?????????????????????????

    Notice that QUAL files don't have a proper sequence present!  But the quality
    information is there:

    >>> print(record)
    ID: EAS54_6_R1_2_1_443_348
    Name: EAS54_6_R1_2_1_443_348
    Description: EAS54_6_R1_2_1_443_348
    Number of features: 0
    Per letter annotation for: phred_quality
    UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?')
    >>> print(record.letter_annotations["phred_quality"])
    [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]

    Just to keep things tidy, if you are following this example yourself, you can
    delete this temporary file now:

    >>> import os
    >>> os.remove("Quality/temp.qual")

    Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL
    files.  Because the Bio.SeqIO system is designed for reading single files, you
    would have to read the two in separately and then combine the data.  However,
    since this is such a common thing to want to do, there is a helper iterator
    defined in this module that does this for you - PairedFastaQualIterator.

    Alternatively, if you have enough RAM to hold all the records in memory at once,
    then a simple dictionary approach would work:

    >>> from Bio import SeqIO
    >>> reads = SeqIO.to_dict(SeqIO.parse("Quality/example.fasta", "fasta"))
    >>> for rec in SeqIO.parse("Quality/example.qual", "qual"):
    ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"]

    You can then access any record by its key, and get both the sequence and the
    quality scores.

    >>> print(reads["EAS54_6_R1_2_1_540_792"].format("fastq"))
    @EAS54_6_R1_2_1_540_792
    TTGGCAGGCCAAGGCCGATGGATCA
    +
    ;;;;;;;;;;;7;;;;;-;;;3;83
    <BLANKLINE>

    It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are
    using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values,
    "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina"
    for the more recent variant), as this cannot be detected reliably
    automatically.

    To illustrate this problem, let's consider an artifical example:

    >>> from Bio.Seq import Seq
    >>> from Bio.Alphabet import generic_dna
    >>> from Bio.SeqRecord import SeqRecord
    >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test",
    ... description="Made up!")
    >>> print(test.format("fasta"))
    >Test Made up!
    NACGTACGTA
    <BLANKLINE>
    >>> print(test.format("fastq"))
    Traceback (most recent call last):
     ...
    ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test).

    We created a sample SeqRecord, and can show it in FASTA format - but for QUAL
    or FASTQ format we need to provide some quality scores. These are held as a
    list of integers (one for each base) in the letter_annotations dictionary:

    >>> test.letter_annotations["phred_quality"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
    >>> print(test.format("qual"))
    >Test Made up!
    0 1 2 3 4 5 10 20 30 40
    <BLANKLINE>
    >>> print(test.format("fastq"))
    @Test Made up!
    NACGTACGTA
    +
    !"#$%&+5?I
    <BLANKLINE>

    We can check this FASTQ encoding - the first PHRED quality was zero, and this
    mapped to a exclamation mark, while the final score was 40 and this mapped to
    the letter "I":

    >>> ord('!') - 33
    0
    >>> ord('I') - 33
    40
    >>> [ord(letter)-33 for letter in '!"#$%&+5?I']
    [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]

    Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED
    scores with an offset of 64:

    >>> print(test.format("fastq-illumina"))
    @Test Made up!
    NACGTACGTA
    +
    @ABCDEJT^h
    <BLANKLINE>

    And we can check this too - the first PHRED score was zero, and this mapped to
    "@", while the final score was 40 and this mapped to "h":

    >>> ord("@") - 64
    0
    >>> ord("h") - 64
    40
    >>> [ord(letter)-64 for letter in "@ABCDEJT^h"]
    [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]

    Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style
    FASTQ files look for the same data! Then we have the older Solexa/Illumina
    format to consider which encodes Solexa scores instead of PHRED scores.

    First let's see what Biopython says if we convert the PHRED scores into Solexa
    scores (rounding to one decimal place):

    >>> for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]:
    ...     print("PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q)))
    PHRED 0 maps to Solexa -5.0
    PHRED 1 maps to Solexa -5.0
    PHRED 2 maps to Solexa -2.3
    PHRED 3 maps to Solexa -0.0
    PHRED 4 maps to Solexa 1.8
    PHRED 5 maps to Solexa 3.3
    PHRED 10 maps to Solexa 9.5
    PHRED 20 maps to Solexa 20.0
    PHRED 30 maps to Solexa 30.0
    PHRED 40 maps to Solexa 40.0

    Now here is the record using the old Solexa style FASTQ file:

    >>> print(test.format("fastq-solexa"))
    @Test Made up!
    NACGTACGTA
    +
    ;;>@BCJT^h
    <BLANKLINE>

    Again, this is using an ASCII offset of 64, so we can check the Solexa scores:

    >>> [ord(letter)-64 for letter in ";;>@BCJT^h"]
    [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40]

    This explains why the last few letters of this FASTQ output matched that using
    the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores
    are approximately equal.

CLASSES
    Bio.SeqIO.Interfaces.SequentialSequenceWriter(Bio.SeqIO.Interfaces.SequenceWriter)
        FastqIlluminaWriter
        FastqPhredWriter
        FastqSolexaWriter
        QualPhredWriter

    class FastqIlluminaWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)
     |  Write Illumina 1.3+ FASTQ format files (with PHRED quality scores).
     |
     |  This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline,
     |  using PHRED scores and an ASCII offset of 64. Note these files are NOT
     |  compatible with the standard Sanger style PHRED FASTQ files which use an
     |  ASCII offset of 32.
     |
     |  Although you can use this class directly, you are strongly encouraged to
     |  use the Bio.SeqIO.write() function with format name "fastq-illumina"
     |  instead. This code is also called if you use the .format("fastq-illumina")
     |  method of a SeqRecord. For example,
     |
     |  >>> from Bio import SeqIO
     |  >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger")
     |  >>> print(record.format("fastq-illumina"))
     |  @Test PHRED qualities from 40 to 0 inclusive
     |  ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
     |  +
     |  hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
     |  <BLANKLINE>
     |
     |  Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is
     |  encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
     |  warning is issued.
     |
     |  Method resolution order:
     |      FastqIlluminaWriter
     |      Bio.SeqIO.Interfaces.SequentialSequenceWriter
     |      Bio.SeqIO.Interfaces.SequenceWriter
     |      builtins.object
     |
     |  Methods defined here:
     |
     |  write_record(self, record)
     |      Write a single FASTQ record to the file.
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:
     |
     |  __init__(self, handle)
     |      Creates the writer object.
     |
     |      Use the method write_file() to actually record your sequence records.
     |
     |  write_file(self, records)
     |      Use this to write an entire file containing the given records.
     |
     |      records - A list or iterator returning SeqRecord objects
     |
     |      This method can only be called once.  Returns the number of records
     |      written.
     |
     |  write_footer(self)
     |
     |  write_header(self)
     |
     |  write_records(self, records)
     |      Write multiple record to the output file.
     |
     |      records - A list or iterator returning SeqRecord objects
     |
     |      Once you have called write_header() you can call write_record()
     |      and/or write_records() as many times as needed.  Then call
     |      write_footer() and close().
     |
     |      Returns the number of records written.
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:
     |
     |  clean(self, text)
     |      Use this to avoid getting newlines in the output.
     |
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:
     |
     |  __dict__
     |      dictionary for instance variables (if defined)
     |
     |  __weakref__
     |      list of weak references to the object (if defined)

    class FastqPhredWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)
     |  Class to write standard FASTQ format files (using PHRED quality scores).
     |
     |  Although you can use this class directly, you are strongly encouraged
     |  to use the Bio.SeqIO.write() function instead via the format name "fastq"
     |  or the alias "fastq-sanger".  For example, this code reads in a standard
     |  Sanger style FASTQ file (using PHRED scores) and re-saves it as another
     |  Sanger style FASTQ file:
     |
     |  >>> from Bio import SeqIO
     |  >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
     |  >>> with open("Quality/temp.fastq", "w") as out_handle:
     |  ...     SeqIO.write(record_iterator, out_handle, "fastq")
     |  3
     |
     |  You might want to do this if the original file included extra line breaks,
     |  which while valid may not be supported by all tools.  The output file from
     |  Biopython will have each sequence on a single line, and each quality
     |  string on a single line (which is considered desirable for maximum
     |  compatibility).
     |
     |  In this next example, an old style Solexa/Illumina FASTQ file (using Solexa
     |  quality scores) is converted into a standard Sanger style FASTQ file using
     |  PHRED qualities:
     |
     |  >>> from Bio import SeqIO
     |  >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa")
     |  >>> with open("Quality/temp.fastq", "w") as out_handle:
     |  ...     SeqIO.write(record_iterator, out_handle, "fastq")
     |  5
     |
     |  This code is also called if you use the .format("fastq") method of a
     |  SeqRecord, or .format("fastq-sanger") if you prefer that alias.
     |
     |  Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is
     |  encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
     |  warning is issued.
     |
     |  P.S. To avoid cluttering up your working directory, you can delete this
     |  temporary file now:
     |
     |  >>> import os
     |  >>> os.remove("Quality/temp.fastq")
     |
     |  Method resolution order:
     |      FastqPhredWriter
     |      Bio.SeqIO.Interfaces.SequentialSequenceWriter
     |      Bio.SeqIO.Interfaces.SequenceWriter
     |      builtins.object
     |
     |  Methods defined here:
     |
     |  write_record(self, record)
     |      Write a single FASTQ record to the file.
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:
     |
     |  __init__(self, handle)
     |      Creates the writer object.
     |
     |      Use the method write_file() to actually record your sequence records.
     |
     |  write_file(self, records)
     |      Use this to write an entire file containing the given records.
     |
     |      records - A list or iterator returning SeqRecord objects
     |
     |      This method can only be called once.  Returns the number of records
     |      written.
     |
     |  write_footer(self)
     |
     |  write_header(self)
     |
     |  write_records(self, records)
     |      Write multiple record to the output file.
     |
     |      records - A list or iterator returning SeqRecord objects
     |
     |      Once you have called write_header() you can call write_record()
     |      and/or write_records() as many times as needed.  Then call
     |      write_footer() and close().
     |
     |      Returns the number of records written.
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:
     |
     |  clean(self, text)
     |      Use this to avoid getting newlines in the output.
     |
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:
     |
     |  __dict__
     |      dictionary for instance variables (if defined)
     |
     |  __weakref__
     |      list of weak references to the object (if defined)

    class FastqSolexaWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)
     |  Write old style Solexa/Illumina FASTQ format files (with Solexa qualities).
     |
     |  This outputs FASTQ files like those from the early Solexa/Illumina
     |  pipeline, using Solexa scores and an ASCII offset of 64. These are
     |  NOT compatible with the standard Sanger style PHRED FASTQ files.
     |
     |  If your records contain a "solexa_quality" entry under letter_annotations,
     |  this is used, otherwise any "phred_quality" entry will be used after
     |  conversion using the solexa_quality_from_phred function. If neither style
     |  of quality scores are present, an exception is raised.
     |
     |  Although you can use this class directly, you are strongly encouraged
     |  to use the Bio.SeqIO.write() function instead.  For example, this code
     |  reads in a FASTQ file and re-saves it as another FASTQ file:
     |
     |  >>> from Bio import SeqIO
     |  >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa")
     |  >>> with open("Quality/temp.fastq", "w") as out_handle:
     |  ...     SeqIO.write(record_iterator, out_handle, "fastq-solexa")
     |  5
     |
     |  You might want to do this if the original file included extra line breaks,
     |  which (while valid) may not be supported by all tools.  The output file
     |  from Biopython will have each sequence on a single line, and each quality
     |  string on a single line (which is considered desirable for maximum
     |  compatibility).
     |
     |  This code is also called if you use the .format("fastq-solexa") method of
     |  a SeqRecord. For example,
     |
     |  >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger")
     |  >>> print(record.format("fastq-solexa"))
     |  @Test PHRED qualities from 40 to 0 inclusive
     |  ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
     |  +
     |  hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;
     |  <BLANKLINE>
     |
     |  Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is
     |  encoded as ASCII 126, the tilde.  If your quality scores must be truncated to fit,
     |  a warning is issued.
     |
     |  P.S. Don't forget to delete the temp file if you don't need it anymore:
     |
     |  >>> import os
     |  >>> os.remove("Quality/temp.fastq")
     |
     |  Method resolution order:
     |      FastqSolexaWriter
     |      Bio.SeqIO.Interfaces.SequentialSequenceWriter
     |      Bio.SeqIO.Interfaces.SequenceWriter
     |      builtins.object
     |
     |  Methods defined here:
     |
     |  write_record(self, record)
     |      Write a single FASTQ record to the file.
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:
     |
     |  __init__(self, handle)
     |      Creates the writer object.
     |
     |      Use the method write_file() to actually record your sequence records.
     |
     |  write_file(self, records)
     |      Use this to write an entire file containing the given records.
     |
     |      records - A list or iterator returning SeqRecord objects
     |
     |      This method can only be called once.  Returns the number of records
     |      written.
     |
     |  write_footer(self)
     |
     |  write_header(self)
     |
     |  write_records(self, records)
     |      Write multiple record to the output file.
     |
     |      records - A list or iterator returning SeqRecord objects
     |
     |      Once you have called write_header() you can call write_record()
     |      and/or write_records() as many times as needed.  Then call
     |      write_footer() and close().
     |
     |      Returns the number of records written.
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:
     |
     |  clean(self, text)
     |      Use this to avoid getting newlines in the output.
     |
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:
     |
     |  __dict__
     |      dictionary for instance variables (if defined)
     |
     |  __weakref__
     |      list of weak references to the object (if defined)

    class QualPhredWriter(Bio.SeqIO.Interfaces.SequentialSequenceWriter)
     |  Class to write QUAL format files (using PHRED quality scores).
     |
     |  Although you can use this class directly, you are strongly encouraged
     |  to use the Bio.SeqIO.write() function instead.  For example, this code
     |  reads in a FASTQ file and saves the quality scores into a QUAL file:
     |
     |  >>> from Bio import SeqIO
     |  >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
     |  >>> with open("Quality/temp.qual", "w") as out_handle:
     |  ...     SeqIO.write(record_iterator, out_handle, "qual")
     |  3
     |
     |  This code is also called if you use the .format("qual") method of a
     |  SeqRecord.
     |
     |  P.S. Don't forget to clean up the temp file if you don't need it anymore:
     |
     |  >>> import os
     |  >>> os.remove("Quality/temp.qual")
     |
     |  Method resolution order:
     |      QualPhredWriter
     |      Bio.SeqIO.Interfaces.SequentialSequenceWriter
     |      Bio.SeqIO.Interfaces.SequenceWriter
     |      builtins.object
     |
     |  Methods defined here:
     |
     |  __init__(self, handle, wrap=60, record2title=None)
     |      Create a QUAL writer.
     |
     |      Arguments:
     |       - handle - Handle to an output file, e.g. as returned
     |                  by open(filename, "w")
     |       - wrap   - Optional line length used to wrap sequence lines.
     |                  Defaults to wrapping the sequence at 60 characters
     |                  Use zero (or None) for no wrapping, giving a single
     |                  long line for the sequence.
     |       - record2title - Optional function to return the text to be
     |                  used for the title line of each record.  By default
     |                  a combination of the record.id and record.description
     |                  is used.  If the record.description starts with the
     |                  record.id, then just the record.description is used.
     |
     |      The record2title argument is present for consistency with the
     |      Bio.SeqIO.FastaIO writer class.
     |
     |  write_record(self, record)
     |      Write a single QUAL record to the file.
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from Bio.SeqIO.Interfaces.SequentialSequenceWriter:
     |
     |  write_file(self, records)
     |      Use this to write an entire file containing the given records.
     |
     |      records - A list or iterator returning SeqRecord objects
     |
     |      This method can only be called once.  Returns the number of records
     |      written.
     |
     |  write_footer(self)
     |
     |  write_header(self)
     |
     |  write_records(self, records)
     |      Write multiple record to the output file.
     |
     |      records - A list or iterator returning SeqRecord objects
     |
     |      Once you have called write_header() you can call write_record()
     |      and/or write_records() as many times as needed.  Then call
     |      write_footer() and close().
     |
     |      Returns the number of records written.
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:
     |
     |  clean(self, text)
     |      Use this to avoid getting newlines in the output.
     |
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:
     |
     |  __dict__
     |      dictionary for instance variables (if defined)
     |
     |  __weakref__
     |      list of weak references to the object (if defined)

FUNCTIONS
    FastqGeneralIterator(handle)
        Iterate over Fastq records as string tuples (not as SeqRecord objects).

        This code does not try to interpret the quality string numerically.  It
        just returns tuples of the title, sequence and quality as strings.  For
        the sequence and quality, any whitespace (such as new lines) is removed.

        Our SeqRecord based FASTQ iterators call this function internally, and then
        turn the strings into a SeqRecord objects, mapping the quality string into
        a list of numerical scores.  If you want to do a custom quality mapping,
        then you might consider calling this function directly.

        For parsing FASTQ files, the title string from the "@" line at the start
        of each record can optionally be omitted on the "+" lines.  If it is
        repeated, it must be identical.

        The sequence string and the quality string can optionally be split over
        multiple lines, although several sources discourage this.  In comparison,
        for the FASTA file format line breaks between 60 and 80 characters are
        the norm.

        **WARNING** - Because the "@" character can appear in the quality string,
        this can cause problems as this is also the marker for the start of
        a new sequence.  In fact, the "+" sign can also appear as well.  Some
        sources recommended having no line breaks in the  quality to avoid this,
        but even that is not enough, consider this example::

            @071113_EAS56_0053:1:1:998:236
            TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
            +071113_EAS56_0053:1:1:998:236
            IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
            @071113_EAS56_0053:1:1:182:712
            ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
            +
            @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
            @071113_EAS56_0053:1:1:153:10
            TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
            +
            IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
            @071113_EAS56_0053:1:3:990:501
            TGGGAGGTTTTATGTGGA
            AAGCAGCAATGTACAAGA
            +
            IIIIIII.IIIIII1@44
            @-7.%<&+/$/%4(++(%

        This is four PHRED encoded FASTQ entries originally from an NCBI source
        (given the read length of 36, these are probably Solexa Illumina reads where
        the quality has been mapped onto the PHRED values).

        This example has been edited to illustrate some of the nasty things allowed
        in the FASTQ format.  Firstly, on the "+" lines most but not all of the
        (redundant) identifiers are omitted.  In real files it is likely that all or
        none of these extra identifiers will be present.

        Secondly, while the first three sequences have been shown without line
        breaks, the last has been split over multiple lines.  In real files any line
        breaks are likely to be consistent.

        Thirdly, some of the quality string lines start with an "@" character.  For
        the second record this is unavoidable.  However for the fourth sequence this
        only happens because its quality string is split over two lines.  A naive
        parser could wrongly treat any line starting with an "@" as the beginning of
        a new sequence!  This code copes with this possible ambiguity by keeping
        track of the length of the sequence which gives the expected length of the
        quality string.

        Using this tricky example file as input, this short bit of code demonstrates
        what this parsing function would return:

        >>> with open("Quality/tricky.fastq", "rU") as handle:
        ...     for (title, sequence, quality) in FastqGeneralIterator(handle):
        ...         print(title)
        ...         print("%s %s" % (sequence, quality))
        ...
        071113_EAS56_0053:1:1:998:236
        TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
        071113_EAS56_0053:1:1:182:712
        ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
        071113_EAS56_0053:1:1:153:10
        TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
        071113_EAS56_0053:1:3:990:501
        TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%

        Finally we note that some sources state that the quality string should
        start with "!" (which using the PHRED mapping means the first letter always
        has a quality score of zero).  This rather restrictive rule is not widely
        observed, so is therefore ignored here.  One plus point about this "!" rule
        is that (provided there are no line breaks in the quality sequence) it
        would prevent the above problem with the "@" character.

    FastqIlluminaIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)
        Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping).

        The optional arguments are the same as those for the FastqPhredIterator.

        For each sequence in Illumina 1.3+ FASTQ files there is a matching string
        encoding PHRED integer qualities using ASCII values with an offset of 64.

        >>> from Bio import SeqIO
        >>> record = SeqIO.read("Quality/illumina_faked.fastq", "fastq-illumina")
        >>> print("%s %s" % (record.id, record.seq))
        Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
        >>> max(record.letter_annotations["phred_quality"])
        40
        >>> min(record.letter_annotations["phred_quality"])
        0

        NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores
        with an ASCII offset of 64. They are approximately equal but only for high
        quality reads. If you have an old Solexa/Illumina file with negative
        Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:

        >>> record2 = SeqIO.read("Quality/solexa_faked.fastq", "fastq-illumina")
        Traceback (most recent call last):
           ...
        ValueError: Invalid character in quality string

        NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.

    FastqPhredIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)
        Generator function to iterate over FASTQ records (as SeqRecord objects).

            - handle - input file
            - alphabet - optional alphabet
            - title2ids - A function that, when given the title line from the FASTQ
              file (without the beginning >), will return the id, name and
              description (in that order) for the record as a tuple of
              strings.  If this is not given, then the entire title line
              will be used as the description, and the first word as the
              id and name.

        Note that use of title2ids matches that of Bio.SeqIO.FastaIO.

        For each sequence in a (Sanger style) FASTQ file there is a matching string
        encoding the PHRED qualities (integers between 0 and about 90) using ASCII
        values with an offset of 33.

        For example, consider a file containing three short reads::

            @EAS54_6_R1_2_1_413_324
            CCCTTCTTGTCTTCAGCGTTTCTCC
            +
            ;;3;;;;;;;;;;;;7;;;;;;;88
            @EAS54_6_R1_2_1_540_792
            TTGGCAGGCCAAGGCCGATGGATCA
            +
            ;;;;;;;;;;;7;;;;;-;;;3;83
            @EAS54_6_R1_2_1_443_348
            GTTGCTTCTGGCGTGGGTGGGGGGG
            +
            ;;;;;;;;;;;9;7;;.7;393333

        For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
        string encoding the PHRED qualities using a ASCII values with an offset of
        33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").

        Using this module directly you might run:

        >>> with open("Quality/example.fastq", "rU") as handle:
        ...     for record in FastqPhredIterator(handle):
        ...         print("%s %s" % (record.id, record.seq))
        EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
        EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
        EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG

        Typically however, you would call this via Bio.SeqIO instead with "fastq"
        (or "fastq-sanger") as the format:

        >>> from Bio import SeqIO
        >>> with open("Quality/example.fastq", "rU") as handle:
        ...     for record in SeqIO.parse(handle, "fastq"):
        ...         print("%s %s" % (record.id, record.seq))
        EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
        EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
        EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG

        If you want to look at the qualities, they are record in each record's
        per-letter-annotation dictionary as a simple list of integers:

        >>> print(record.letter_annotations["phred_quality"])
        [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]

    FastqSolexaIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)
        Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping).

        The optional arguments are the same as those for the FastqPhredIterator.

        For each sequence in Solexa/Illumina FASTQ files there is a matching string
        encoding the Solexa integer qualities using ASCII values with an offset
        of 64.  Solexa scores are scaled differently to PHRED scores, and Biopython
        will NOT perform any automatic conversion when loading.

        NOTE - This file format is used by the OLD versions of the Solexa/Illumina
        pipeline. See also the FastqIlluminaIterator function for the NEW version.

        For example, consider a file containing these five records::

            @SLXA-B3_649_FC8437_R1_1_1_610_79
            GATGTGCAATACCTTTGTAGAGGAA
            +SLXA-B3_649_FC8437_R1_1_1_610_79
            YYYYYYYYYYYYYYYYYYWYWYYSU
            @SLXA-B3_649_FC8437_R1_1_1_397_389
            GGTTTGAGAAAGAGAAATGAGATAA
            +SLXA-B3_649_FC8437_R1_1_1_397_389
            YYYYYYYYYWYYYYWWYYYWYWYWW
            @SLXA-B3_649_FC8437_R1_1_1_850_123
            GAGGGTGTTGATCATGATGATGGCG
            +SLXA-B3_649_FC8437_R1_1_1_850_123
            YYYYYYYYYYYYYWYYWYYSYYYSY
            @SLXA-B3_649_FC8437_R1_1_1_362_549
            GGAAACAAAGTTTTTCTCAACATAG
            +SLXA-B3_649_FC8437_R1_1_1_362_549
            YYYYYYYYYYYYYYYYYYWWWWYWY
            @SLXA-B3_649_FC8437_R1_1_1_183_714
            GTATTATTTAATGGCATACACTCAA
            +SLXA-B3_649_FC8437_R1_1_1_183_714
            YYYYYYYYYYWYYYYWYWWUWWWQQ

        Using this module directly you might run:

        >>> with open("Quality/solexa_example.fastq", "rU") as handle:
        ...     for record in FastqSolexaIterator(handle):
        ...         print("%s %s" % (record.id, record.seq))
        SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
        SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
        SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
        SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
        SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA

        Typically however, you would call this via Bio.SeqIO instead with
        "fastq-solexa" as the format:

        >>> from Bio import SeqIO
        >>> with open("Quality/solexa_example.fastq", "rU") as handle:
        ...     for record in SeqIO.parse(handle, "fastq-solexa"):
        ...         print("%s %s" % (record.id, record.seq))
        SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
        SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
        SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
        SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
        SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA

        If you want to look at the qualities, they are recorded in each record's
        per-letter-annotation dictionary as a simple list of integers:

        >>> print(record.letter_annotations["solexa_quality"])
        [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]

        These scores aren't very good, but they are high enough that they map
        almost exactly onto PHRED scores:

        >>> print("%0.2f" % phred_quality_from_solexa(25))
        25.01

        Let's look at faked example read which is even worse, where there are
        more noticeable differences between the Solexa and PHRED scores::

             @slxa_0001_1_0001_01
             ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
             +slxa_0001_1_0001_01
             hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;

        Again, you would typically use Bio.SeqIO to read this file in (rather than
        calling the Bio.SeqIO.QualtityIO module directly).  Most FASTQ files will
        contain thousands of reads, so you would normally use Bio.SeqIO.parse()
        as shown above.  This example has only as one entry, so instead we can
        use the Bio.SeqIO.read() function:

        >>> from Bio import SeqIO
        >>> with open("Quality/solexa_faked.fastq", "rU") as handle:
        ...     record = SeqIO.read(handle, "fastq-solexa")
        >>> print("%s %s" % (record.id, record.seq))
        slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
        >>> print(record.letter_annotations["solexa_quality"])
        [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]

        These quality scores are so low that when converted from the Solexa scheme
        into PHRED scores they look quite different:

        >>> print("%0.2f" % phred_quality_from_solexa(-1))
        2.54
        >>> print("%0.2f" % phred_quality_from_solexa(-5))
        1.19

        Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
        method to output the record(s):

        >>> print(record.format("fastq-solexa"))
        @slxa_0001_1_0001_01
        ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
        +
        hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
        <BLANKLINE>

        Note this output is slightly different from the input file as Biopython
        has left out the optional repetition of the sequence identifier on the "+"
        line.  If you want the to use PHRED scores, use "fastq" or "qual" as the
        output format instead, and Biopython will do the conversion for you:

        >>> print(record.format("fastq"))
        @slxa_0001_1_0001_01
        ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
        +
        IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""
        <BLANKLINE>

        >>> print(record.format("qual"))
        >slxa_0001_1_0001_01
        40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
        20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
        1 1
        <BLANKLINE>

        As shown above, the poor quality Solexa reads have been mapped to the
        equivalent PHRED score (e.g. -5 to 1 as shown earlier).

    PairedFastaQualIterator(fasta_handle, qual_handle, alphabet=SingleLetterAlphabet(), title2ids=None)
        Iterate over matched FASTA and QUAL files as SeqRecord objects.

        For example, consider this short QUAL file with PHRED quality scores::

            >EAS54_6_R1_2_1_413_324
            26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
            26 26 26 23 23
            >EAS54_6_R1_2_1_540_792
            26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
            26 18 26 23 18
            >EAS54_6_R1_2_1_443_348
            26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
            24 18 18 18 18

        And a matching FASTA file::

            >EAS54_6_R1_2_1_413_324
            CCCTTCTTGTCTTCAGCGTTTCTCC
            >EAS54_6_R1_2_1_540_792
            TTGGCAGGCCAAGGCCGATGGATCA
            >EAS54_6_R1_2_1_443_348
            GTTGCTTCTGGCGTGGGTGGGGGGG

        You can parse these separately using Bio.SeqIO with the "qual" and
        "fasta" formats, but then you'll get a group of SeqRecord objects with
        no sequence, and a matching group with the sequence but not the
        qualities.  Because it only deals with one input file handle, Bio.SeqIO
        can't be used to read the two files together - but this function can!
        For example,

        >>> with open("Quality/example.fasta", "rU") as f:
        ...     with open("Quality/example.qual", "rU") as q:
        ...         for record in PairedFastaQualIterator(f, q):
        ...             print("%s %s" % (record.id, record.seq))
        ...
        EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
        EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
        EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG

        As with the FASTQ or QUAL parsers, if you want to look at the qualities,
        they are in each record's per-letter-annotation dictionary as a simple
        list of integers:

        >>> print(record.letter_annotations["phred_quality"])
        [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]

        If you have access to data as a FASTQ format file, using that directly
        would be simpler and more straight forward.  Note that you can easily use
        this function to convert paired FASTA and QUAL files into FASTQ files:

        >>> from Bio import SeqIO
        >>> with open("Quality/example.fasta", "rU") as f:
        ...     with open("Quality/example.qual", "rU") as q:
        ...         SeqIO.write(PairedFastaQualIterator(f, q), "Quality/temp.fastq", "fastq")
        ...
        3

        And don't forget to clean up the temp file if you don't need it anymore:

        >>> import os
        >>> os.remove("Quality/temp.fastq")

    QualPhredIterator(handle, alphabet=SingleLetterAlphabet(), title2ids=None)
        For QUAL files which include PHRED quality scores, but no sequence.

        For example, consider this short QUAL file::

            >EAS54_6_R1_2_1_413_324
            26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
            26 26 26 23 23
            >EAS54_6_R1_2_1_540_792
            26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
            26 18 26 23 18
            >EAS54_6_R1_2_1_443_348
            26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
            24 18 18 18 18

        Using this module directly you might run:

        >>> with open("Quality/example.qual", "rU") as handle:
        ...     for record in QualPhredIterator(handle):
        ...         print("%s %s" % (record.id, record.seq))
        EAS54_6_R1_2_1_413_324 ?????????????????????????
        EAS54_6_R1_2_1_540_792 ?????????????????????????
        EAS54_6_R1_2_1_443_348 ?????????????????????????

        Typically however, you would call this via Bio.SeqIO instead with "qual"
        as the format:

        >>> from Bio import SeqIO
        >>> with open("Quality/example.qual", "rU") as handle:
        ...     for record in SeqIO.parse(handle, "qual"):
        ...         print("%s %s" % (record.id, record.seq))
        EAS54_6_R1_2_1_413_324 ?????????????????????????
        EAS54_6_R1_2_1_540_792 ?????????????????????????
        EAS54_6_R1_2_1_443_348 ?????????????????????????

        Becase QUAL files don't contain the sequence string itself, the seq
        property is set to an UnknownSeq object.  As no alphabet was given, this
        has defaulted to a generic single letter alphabet and the character "?"
        used.

        By specifying a nucleotide alphabet, "N" is used instead:

        >>> from Bio import SeqIO
        >>> from Bio.Alphabet import generic_dna
        >>> with open("Quality/example.qual", "rU") as handle:
        ...     for record in SeqIO.parse(handle, "qual", alphabet=generic_dna):
        ...         print("%s %s" % (record.id, record.seq))
        EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
        EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
        EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN

        However, the quality scores themselves are available as a list of integers
        in each record's per-letter-annotation:

        >>> print(record.letter_annotations["phred_quality"])
        [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]

        You can still slice one of these SeqRecord objects with an UnknownSeq:

        >>> sub_record = record[5:10]
        >>> print("%s %s" % (sub_record.id, sub_record.letter_annotations["phred_quality"]))
        EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]

        As of Biopython 1.59, this parser will accept files with negatives quality
        scores but will replace them with the lowest possible PHRED score of zero.
        This will trigger a warning, previously it raised a ValueError exception.

    log(...)
        log(x[, base])

        Return the logarithm of x to the given base.
        If the base not specified, returns the natural logarithm (base e) of x.

    phred_quality_from_solexa(solexa_quality)
        Convert a Solexa quality (which can be negative) to a PHRED quality.

        PHRED and Solexa quality scores are both log transformations of a
        probality of error (high score = low probability of error). This function
        takes a Solexa score, transforms it back to a probability of error, and
        then re-expresses it as a PHRED score. This assumes the error estimates
        are equivalent.

        The underlying formulas are given in the documentation for the sister
        function solexa_quality_from_phred, in this case the operation is::

            phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)

        This will return a floating point number, it is up to you to round this to
        the nearest integer if appropriate.  e.g.

        >>> print("%0.2f" % round(phred_quality_from_solexa(80), 2))
        80.00
        >>> print("%0.2f" % round(phred_quality_from_solexa(20), 2))
        20.04
        >>> print("%0.2f" % round(phred_quality_from_solexa(10), 2))
        10.41
        >>> print("%0.2f" % round(phred_quality_from_solexa(0), 2))
        3.01
        >>> print("%0.2f" % round(phred_quality_from_solexa(-5), 2))
        1.19

        Note that a solexa_quality less then -5 is not expected, will trigger a
        warning, but will still be converted as per the logarithmic mapping
        (giving a number between 0 and 1.19 back).

        As a special case where None is used for a "missing value", None is
        returned:

        >>> print(phred_quality_from_solexa(None))
        None

    solexa_quality_from_phred(phred_quality)
        Covert a PHRED quality (range 0 to about 90) to a Solexa quality.

        PHRED and Solexa quality scores are both log transformations of a
        probality of error (high score = low probability of error). This function
        takes a PHRED score, transforms it back to a probability of error, and
        then re-expresses it as a Solexa score. This assumes the error estimates
        are equivalent.

        How does this work exactly? Well the PHRED quality is minus ten times the
        base ten logarithm of the probability of error::

            phred_quality = -10*log(error,10)

        Therefore, turning this round::

            error = 10 ** (- phred_quality / 10)

        Now, Solexa qualities use a different log transformation::

            solexa_quality = -10*log(error/(1-error),10)

        After substitution and a little manipulation we get::

             solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)

        However, real Solexa files use a minimum quality of -5. This does have a
        good reason - a random base call would be correct 25% of the time,
        and thus have a probability of error of 0.75, which gives 1.25 as the PHRED
        quality, or -4.77 as the Solexa quality. Thus (after rounding), a random
        nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.

        Taken literally, this logarithic formula would map a PHRED quality of zero
        to a Solexa quality of minus infinity. Of course, taken literally, a PHRED
        score of zero means a probability of error of one (i.e. the base call is
        definitely wrong), which is worse than random! In practice, a PHRED quality
        of zero usually means a default value, or perhaps random - and therefore
        mapping it to the minimum Solexa score of -5 is reasonable.

        In conclusion, we follow EMBOSS, and take this logarithmic formula but also
        apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED
        quality of zero to -5.0 as well.

        Note this function will return a floating point number, it is up to you to
        round this to the nearest integer if appropriate.  e.g.

        >>> print("%0.2f" % round(solexa_quality_from_phred(80), 2))
        80.00
        >>> print("%0.2f" % round(solexa_quality_from_phred(50), 2))
        50.00
        >>> print("%0.2f" % round(solexa_quality_from_phred(20), 2))
        19.96
        >>> print("%0.2f" % round(solexa_quality_from_phred(10), 2))
        9.54
        >>> print("%0.2f" % round(solexa_quality_from_phred(5), 2))
        3.35
        >>> print("%0.2f" % round(solexa_quality_from_phred(4), 2))
        1.80
        >>> print("%0.2f" % round(solexa_quality_from_phred(3), 2))
        -0.02
        >>> print("%0.2f" % round(solexa_quality_from_phred(2), 2))
        -2.33
        >>> print("%0.2f" % round(solexa_quality_from_phred(1), 2))
        -5.00
        >>> print("%0.2f" % round(solexa_quality_from_phred(0), 2))
        -5.00

        Notice that for high quality reads PHRED and Solexa scores are numerically
        equal. The differences are important for poor quality reads, where PHRED
        has a minimum of zero but Solexa scores can be negative.

        Finally, as a special case where None is used for a "missing value", None
        is returned:

        >>> print(solexa_quality_from_phred(None))
        None

DATA
    SANGER_SCORE_OFFSET = 33
    SOLEXA_SCORE_OFFSET = 64
    __docformat__ = 'restructuredtext en'
    print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...
    single_letter_alphabet = SingleLetterAlphabet()

FILE
    /home/tiago_antao/miniconda/lib/python3.5/site-packages/Bio/SeqIO/QualityIO.py


Converting FASTA and QUAL files into FASTQ files

FASTQ files hold both sequences and their quality strings. FASTA files hold just sequences, while QUAL files hold just the qualities. Therefore a single FASTQ file can be converted to or from paired FASTA and QUAL files.

Going from FASTQ to FASTA is easy:

from Bio import SeqIO
SeqIO.convert("example.fastq", "fastq", "example.fasta", "fasta")

Going from FASTQ to QUAL is also easy:

from Bio import SeqIO
SeqIO.convert("example.fastq", "fastq", "example.qual", "qual")

However, the reverse is a little more tricky. You can use Bio.SeqIO.parse() to iterate over the records in a single file, but in this case we have two input files. There are several strategies possible, but assuming that the two files are really paired the most memory efficient way is to loop over both together. The code is a little fiddly, so we provide a function called PairedFastaQualIterator in the Bio.SeqIO.QualityIO module to do this. This takes two handles (the FASTA file and the QUAL file) and returns a SeqRecord iterator:

from Bio.SeqIO.QualityIO import PairedFastaQualIterator
for record in PairedFastaQualIterator(open("example.fasta"), open("example.qual")):
   print(record)

This function will check that the FASTA and QUAL files are consistent (e.g. the records are in the same order, and have the same sequence length). You can combine this with the Bio.SeqIO.write() function to convert a pair of FASTA and QUAL files into a single FASTQ files:

from Bio import SeqIO
from Bio.SeqIO.QualityIO import PairedFastaQualIterator
handle = open("temp.fastq", "w") #w=write
records = PairedFastaQualIterator(open("example.fasta"), open("example.qual"))
count = SeqIO.write(records, handle, "fastq")
handle.close()
print("Converted %i records" % count)

Indexing a FASTQ file

FASTQ files are often very large, with millions of reads in them. Due to the sheer amount of data, you can’t load all the records into memory at once. This is why the examples above (filtering and trimming) iterate over the file looking at just one SeqRecord at a time.

However, sometimes you can’t use a big loop or an iterator - you may need random access to the reads. Here the Bio.SeqIO.index() function may prove very helpful, as it allows you to access any read in the FASTQ file by its name (see Section [sec:SeqIO-index]).

Again we’ll use the SRR020192.fastq file from the ENA (ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz), although this is actually quite a small FASTQ file with less than \(50,000\) reads:

In [9]:
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz
!gzip -d SRR020192.fastq.gz
In [10]:
from Bio import SeqIO
fq_dict = SeqIO.index("SRR020192.fastq", "fastq")
len(fq_dict)
Out[10]:
41892
In [12]:
fq_dict["SRR020192.23186"].seq
Out[12]:
Seq('GTCCCAGTATTCGGATTTGTCTGCCAAAACAATGAAATTGACACAGTTTACAAC...CCG', SingleLetterAlphabet())

When testing this on a FASTQ file with seven million reads, indexing took about a minute, but record access was almost instant.

The example in Section [sec:SeqIO-sort] show how you can use the Bio.SeqIO.index() function to sort a large FASTA file – this could also be used on FASTQ files.

Converting SFF files

If you work with 454 (Roche) sequence data, you will probably have access to the raw data as a Standard Flowgram Format (SFF) file. This contains the sequence reads (called bases) with quality scores and the original flow information.

A common task is to convert from SFF to a pair of FASTA and QUAL files, or to a single FASTQ file. These operations are trivial using the Bio.SeqIO.convert() function (see Section [sec:SeqIO-conversion]):

In [14]:
from Bio import SeqIO
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff", "reads.fasta", "fasta")
Out[14]:
10
In [15]:
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff", "reads.qual", "qual")
Out[15]:
10
In [16]:
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff", "reads.fastq", "fastq")
Out[16]:
10

Remember the convert function returns the number of records, in this example just ten. This will give you the untrimmed reads, where the leading and trailing poor quality sequence or adaptor will be in lower case. If you want the trimmed reads (using the clipping information recorded within the SFF file) use this:

In [17]:
from Bio import SeqIO
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fasta", "fasta")
Out[17]:
10
In [18]:
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.qual", "qual")
Out[18]:
10
In [19]:
SeqIO.convert("data/E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fastq", "fastq")
Out[19]:
10

If you run Linux, you could ask Roche for a copy of their “off instrument” tools (often referred to as the Newbler tools). This offers an alternative way to do SFF to FASTA or QUAL conversion at the command line (but currently FASTQ output is not supported), e.g.

$ sffinfo -seq -notrim E3MFGYR02_random_10_reads.sff > reads.fasta
$ sffinfo -qual -notrim E3MFGYR02_random_10_reads.sff > reads.qual
$ sffinfo -seq -trim E3MFGYR02_random_10_reads.sff > trimmed.fasta
$ sffinfo -qual -trim E3MFGYR02_random_10_reads.sff > trimmed.qual

The way Biopython uses mixed case sequence strings to represent the trimming points deliberately mimics what the Roche tools do.

For more information on the Biopython SFF support, consult the built in help:

In [20]:
from Bio.SeqIO import SffIO
help(SffIO)
Help on module Bio.SeqIO.SffIO in Bio.SeqIO:

NAME
    Bio.SeqIO.SffIO - Bio.SeqIO support for the binary Standard Flowgram Format (SFF) file format.

DESCRIPTION
    SFF was designed by 454 Life Sciences (Roche), the Whitehead Institute for
    Biomedical Research and the Wellcome Trust Sanger Institute. SFF was also used
    as the native output format from early versions of Ion Torrent's PGM platform
    as well. You are expected to use this module via the Bio.SeqIO functions under
    the format name "sff" (or "sff-trim" as described below).

    For example, to iterate over the records in an SFF file,

        >>> from Bio import SeqIO
        >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"):
        ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
        ...
        E3MFGYR02JWQ7T 265 tcagGGTCTACATGTTGGTT...
        E3MFGYR02JA6IL 271 tcagTTTTTTTTGGAAAGGA...
        E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC...
        E3MFGYR02GFKUC 299 tcagCGGCCGGGCCTCTCAT...
        E3MFGYR02FTGED 281 tcagTGGTAATGGGGGGAAA...
        E3MFGYR02FR9G7 261 tcagCTCCGTAAGAAGGTGC...
        E3MFGYR02GAZMS 278 tcagAAAGAAGTAAGGTAAA...
        E3MFGYR02HHZ8O 221 tcagACTTTCTTCTTTACCG...
        E3MFGYR02GPGB1 269 tcagAAGCAGTGGTATCAAC...
        E3MFGYR02F7Z7G 219 tcagAATCATCCACTTTTTA...

    Each SeqRecord object will contain all the annotation from the SFF file,
    including the PHRED quality scores.

        >>> print("%s %i" % (record.id, len(record)))
        E3MFGYR02F7Z7G 219
        >>> print("%s..." % record.seq[:10])
        tcagAATCAT...
        >>> print("%r..." % (record.letter_annotations["phred_quality"][:10]))
        [22, 21, 23, 28, 26, 15, 12, 21, 28, 21]...

    Notice that the sequence is given in mixed case, the central upper case region
    corresponds to the trimmed sequence. This matches the output of the Roche
    tools (and the 3rd party tool sff_extract) for SFF to FASTA.

        >>> print(record.annotations["clip_qual_left"])
        4
        >>> print(record.annotations["clip_qual_right"])
        134
        >>> print(record.seq[:4])
        tcag
        >>> print("%s...%s" % (record.seq[4:20], record.seq[120:134]))
        AATCATCCACTTTTTA...CAAAACACAAACAG
        >>> print(record.seq[134:])
        atcttatcaacaaaactcaaagttcctaactgagacacgcaacaggggataagacaaggcacacaggggataggnnnnnnnnnnn

    The annotations dictionary also contains any adapter clip positions
    (usually zero), and information about the flows. e.g.

        >>> len(record.annotations)
        11
        >>> print(record.annotations["flow_key"])
        TCAG
        >>> print(record.annotations["flow_values"][:10])
        (83, 1, 128, 7, 4, 84, 6, 106, 3, 172)
        >>> print(len(record.annotations["flow_values"]))
        400
        >>> print(record.annotations["flow_index"][:10])
        (1, 2, 3, 2, 2, 0, 3, 2, 3, 3)
        >>> print(len(record.annotations["flow_index"]))
        219

    Note that to convert from a raw reading in flow_values to the corresponding
    homopolymer stretch estimate, the value should be rounded to the nearest 100:

        >>> print("%r..." % [int(round(value, -2)) // 100
        ...                  for value in record.annotations["flow_values"][:10]])
        ...
        [1, 0, 1, 0, 0, 1, 0, 1, 0, 2]...

    If a read name is exactly 14 alphanumeric characters, the annotations
    dictionary will also contain meta-data about the read extracted by
    interpretting the name as a 454 Sequencing System "Universal" Accession
    Number. Note that if a read name happens to be exactly 14 alphanumeric
    characters but was not generated automatically, these annotation records
    will contain nonsense information.

        >>> print(record.annotations["region"])
        2
        >>> print(record.annotations["time"])
        [2008, 1, 9, 16, 16, 0]
        >>> print(record.annotations["coords"])
        (2434, 1658)

    As a convenience method, you can read the file with SeqIO format name "sff-trim"
    instead of "sff" to get just the trimmed sequences (without any annotation
    except for the PHRED quality scores and anything encoded in the read names):

        >>> from Bio import SeqIO
        >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim"):
        ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
        ...
        E3MFGYR02JWQ7T 260 GGTCTACATGTTGGTTAACC...
        E3MFGYR02JA6IL 265 TTTTTTTTGGAAAGGAAAAC...
        E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG...
        E3MFGYR02GFKUC 295 CGGCCGGGCCTCTCATCGGT...
        E3MFGYR02FTGED 277 TGGTAATGGGGGGAAATTTA...
        E3MFGYR02FR9G7 256 CTCCGTAAGAAGGTGCTGCC...
        E3MFGYR02GAZMS 271 AAAGAAGTAAGGTAAATAAC...
        E3MFGYR02HHZ8O 150 ACTTTCTTCTTTACCGTAAC...
        E3MFGYR02GPGB1 221 AAGCAGTGGTATCAACGCAG...
        E3MFGYR02F7Z7G 130 AATCATCCACTTTTTAACGT...

    Looking at the final record in more detail, note how this differs to the
    example above:

        >>> print("%s %i" % (record.id, len(record)))
        E3MFGYR02F7Z7G 130
        >>> print("%s..." % record.seq[:10])
        AATCATCCAC...
        >>> print("%r..." % record.letter_annotations["phred_quality"][:10])
        [26, 15, 12, 21, 28, 21, 36, 28, 27, 27]...
        >>> len(record.annotations)
        3
        >>> print(record.annotations["region"])
        2
        >>> print(record.annotations["coords"])
        (2434, 1658)
        >>> print(record.annotations["time"])
        [2008, 1, 9, 16, 16, 0]

    You might use the Bio.SeqIO.convert() function to convert the (trimmed) SFF
    reads into a FASTQ file (or a FASTA file and a QUAL file), e.g.

        >>> from Bio import SeqIO
        >>> try:
        ...     from StringIO import StringIO # Python 2
        ... except ImportError:
        ...     from io import StringIO # Python 3
        ...
        >>> out_handle = StringIO()
        >>> count = SeqIO.convert("Roche/E3MFGYR02_random_10_reads.sff", "sff",
        ...                       out_handle, "fastq")
        ...
        >>> print("Converted %i records" % count)
        Converted 10 records

    The output FASTQ file would start like this:

        >>> print("%s..." % out_handle.getvalue()[:50])
        @E3MFGYR02JWQ7T
        tcagGGTCTACATGTTGGTTAACCCGTACTGATT...

    Bio.SeqIO.index() provides memory efficient random access to the reads in an
    SFF file by name. SFF files can include an index within the file, which can
    be read in making this very fast. If the index is missing (or in a format not
    yet supported in Biopython) the file is indexed by scanning all the reads -
    which is a little slower. For example,

        >>> from Bio import SeqIO
        >>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff")
        >>> record = reads["E3MFGYR02JHD4H"]
        >>> print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
        E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC...
        >>> reads.close()

    Or, using the trimmed reads:

        >>> from Bio import SeqIO
        >>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim")
        >>> record = reads["E3MFGYR02JHD4H"]
        >>> print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
        E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG...
        >>> reads.close()

    You can also use the Bio.SeqIO.write() function with the "sff" format. Note
    that this requires all the flow information etc, and thus is probably only
    useful for SeqRecord objects originally from reading another SFF file (and
    not the trimmed SeqRecord objects from parsing an SFF file as "sff-trim").

    As an example, let's pretend this example SFF file represents some DNA which
    was pre-amplified with a PCR primers AAAGANNNNN. The following script would
    produce a sub-file containing all those reads whose post-quality clipping
    region (i.e. the sequence after trimming) starts with AAAGA exactly (the non-
    degenerate bit of this pretend primer):

        >>> from Bio import SeqIO
        >>> records = (record for record in
        ...            SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff")
        ...            if record.seq[record.annotations["clip_qual_left"]:].startswith("AAAGA"))
        ...
        >>> count = SeqIO.write(records, "temp_filtered.sff", "sff")
        >>> print("Selected %i records" % count)
        Selected 2 records

    Of course, for an assembly you would probably want to remove these primers.
    If you want FASTA or FASTQ output, you could just slice the SeqRecord. However,
    if you want SFF output we have to preserve all the flow information - the trick
    is just to adjust the left clip position!

        >>> from Bio import SeqIO
        >>> def filter_and_trim(records, primer):
        ...     for record in records:
        ...         if record.seq[record.annotations["clip_qual_left"]:].startswith(primer):
        ...             record.annotations["clip_qual_left"] += len(primer)
        ...             yield record
        ...
        >>> records = SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff")
        >>> count = SeqIO.write(filter_and_trim(records, "AAAGA"),
        ...                     "temp_filtered.sff", "sff")
        ...
        >>> print("Selected %i records" % count)
        Selected 2 records

    We can check the results, note the lower case clipped region now includes the "AAAGA"
    sequence:

        >>> for record in SeqIO.parse("temp_filtered.sff", "sff"):
        ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
        ...
        E3MFGYR02JHD4H 310 tcagaaagaCAAGTGGTATC...
        E3MFGYR02GAZMS 278 tcagaaagaAGTAAGGTAAA...
        >>> for record in SeqIO.parse("temp_filtered.sff", "sff-trim"):
        ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20]))
        ...
        E3MFGYR02JHD4H 287 CAAGTGGTATCAACGCAGAG...
        E3MFGYR02GAZMS 266 AGTAAGGTAAATAACAAACG...
        >>> import os
        >>> os.remove("temp_filtered.sff")

    For a description of the file format, please see the Roche manuals and:
    http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats

CLASSES
    Bio.SeqIO.Interfaces.SequenceWriter(builtins.object)
        SffWriter

    class SffWriter(Bio.SeqIO.Interfaces.SequenceWriter)
     |  SFF file writer.
     |
     |  Method resolution order:
     |      SffWriter
     |      Bio.SeqIO.Interfaces.SequenceWriter
     |      builtins.object
     |
     |  Methods defined here:
     |
     |  __init__(self, handle, index=True, xml=None)
     |      Creates the writer object.
     |
     |      - handle - Output handle, ideally in binary write mode.
     |      - index - Boolean argument, should we try and write an index?
     |      - xml - Optional string argument, xml manifest to be recorded in the index
     |        block (see function ReadRocheXmlManifest for reading this data).
     |
     |  write_file(self, records)
     |      Use this to write an entire file containing the given records.
     |
     |  write_header(self)
     |
     |  write_record(self, record)
     |      Write a single additional record to the output file.
     |
     |      This assumes the header has been done.
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from Bio.SeqIO.Interfaces.SequenceWriter:
     |
     |  clean(self, text)
     |      Use this to avoid getting newlines in the output.
     |
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from Bio.SeqIO.Interfaces.SequenceWriter:
     |
     |  __dict__
     |      dictionary for instance variables (if defined)
     |
     |  __weakref__
     |      list of weak references to the object (if defined)

FUNCTIONS
    ReadRocheXmlManifest(handle)
        Reads any Roche style XML manifest data in the SFF "index".

        The SFF file format allows for multiple different index blocks, and Roche
        took advantage of this to define their own index block which also embeds
        an XML manifest string. This is not a publically documented extension to
        the SFF file format, this was reverse engineered.

        The handle should be to an SFF file opened in binary mode. This function
        will use the handle seek/tell functions and leave the handle in an
        arbitrary location.

        Any XML manifest found is returned as a Python string, which you can then
        parse as appropriate, or reuse when writing out SFF files with the
        SffWriter class.

        Returns a string, or raises a ValueError if an Roche manifest could not be
        found.

    SffIterator(handle, alphabet=DNAAlphabet(), trim=False)
        Iterate over Standard Flowgram Format (SFF) reads (as SeqRecord objects).

            - handle - input file, an SFF file, e.g. from Roche 454 sequencing.
              This must NOT be opened in universal read lines mode!
            - alphabet - optional alphabet, defaults to generic DNA.
            - trim - should the sequences be trimmed?

        The resulting SeqRecord objects should match those from a paired FASTA
        and QUAL file converted from the SFF file using the Roche 454 tool
        ssfinfo. i.e. The sequence will be mixed case, with the trim regions
        shown in lower case.

        This function is used internally via the Bio.SeqIO functions:

        >>> from Bio import SeqIO
        >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"):
        ...     print("%s %i" % (record.id, len(record)))
        ...
        E3MFGYR02JWQ7T 265
        E3MFGYR02JA6IL 271
        E3MFGYR02JHD4H 310
        E3MFGYR02GFKUC 299
        E3MFGYR02FTGED 281
        E3MFGYR02FR9G7 261
        E3MFGYR02GAZMS 278
        E3MFGYR02HHZ8O 221
        E3MFGYR02GPGB1 269
        E3MFGYR02F7Z7G 219

        You can also call it directly:

        >>> with open("Roche/E3MFGYR02_random_10_reads.sff", "rb") as handle:
        ...     for record in SffIterator(handle):
        ...         print("%s %i" % (record.id, len(record)))
        ...
        E3MFGYR02JWQ7T 265
        E3MFGYR02JA6IL 271
        E3MFGYR02JHD4H 310
        E3MFGYR02GFKUC 299
        E3MFGYR02FTGED 281
        E3MFGYR02FR9G7 261
        E3MFGYR02GAZMS 278
        E3MFGYR02HHZ8O 221
        E3MFGYR02GPGB1 269
        E3MFGYR02F7Z7G 219

        Or, with the trim option:

        >>> with open("Roche/E3MFGYR02_random_10_reads.sff", "rb") as handle:
        ...     for record in SffIterator(handle, trim=True):
        ...         print("%s %i" % (record.id, len(record)))
        ...
        E3MFGYR02JWQ7T 260
        E3MFGYR02JA6IL 265
        E3MFGYR02JHD4H 292
        E3MFGYR02GFKUC 295
        E3MFGYR02FTGED 277
        E3MFGYR02FR9G7 256
        E3MFGYR02GAZMS 271
        E3MFGYR02HHZ8O 150
        E3MFGYR02GPGB1 221
        E3MFGYR02F7Z7G 130

DATA
    __docformat__ = 'restructuredtext en'
    print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...

FILE
    /home/tiago_antao/miniconda/lib/python3.5/site-packages/Bio/SeqIO/SffIO.py


Identifying open reading frames

A very simplistic first step at identifying possible genes is to look for open reading frames (ORFs). By this we mean look in all six frames for long regions without stop codons – an ORF is just a region of nucleotides with no in frame stop codons.

Of course, to find a gene you would also need to worry about locating a start codon, possible promoters – and in Eukaryotes there are introns to worry about too. However, this approach is still useful in viruses and Prokaryotes.

To show how you might approach this with Biopython, we’ll need a sequence to search, and as an example we’ll again use the bacterial plasmid – although this time we’ll start with a plain FASTA file with no pre-marked genes: `NC_005816.fna <http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.fna>`__. This is a bacterial sequence, so we’ll want to use NCBI codon table 11 (see Section [sec:translation] about translation).

In [22]:
from Bio import SeqIO
record = SeqIO.read("data/NC_005816.fna", "fasta")
table = 11
min_pro_len = 100

Here is a neat trick using the Seq object’s split method to get a list of all the possible ORF translations in the six reading frames:

In [24]:
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
    for frame in range(3):
        length = 3 * ((len(record)-frame) // 3) #Multiple of three
        for pro in nuc[frame:frame+length].translate(table).split("*"):
            if len(pro) >= min_pro_len:
                print("%s...%s - length %i, strand %i, frame %i"
                      % (pro[:30], pro[-3:], len(pro), strand, frame))
GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, frame 0
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, frame 1
GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, frame 1
VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, frame 1
NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, frame 2
RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, frame 2
TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, frame 2
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, frame 0
IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, frame 0
WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, frame 1
RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, frame 1
WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, frame 1
LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, frame 2
RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, frame 2

Note that here we are counting the frames from the 5’ end (start) of each strand. It is sometimes easier to always count from the 5’ end (start) of the forward strand.

You could easily edit the above loop based code to build up a list of the candidate proteins, or convert this to a list comprehension. Now, one thing this code doesn’t do is keep track of where the proteins are.

You could tackle this in several ways. For example, the following code tracks the locations in terms of the protein counting, and converts back to the parent sequence by multiplying by three, then adjusting for the frame and strand:

from Bio import SeqIO
record = SeqIO.read("NC_005816.gb","genbank")
table = 11
min_pro_len = 100

def find_orfs_with_trans(seq, trans_table, min_protein_length):
    answer = []
    seq_len = len(seq)
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            trans = str(nuc[frame:].translate(trans_table))
            trans_len = len(trans)
            aa_start = 0
            aa_end = 0
            while aa_start < trans_len:
                aa_end = trans.find("*", aa_start)
                if aa_end == -1:
                    aa_end = trans_len
                if aa_end-aa_start >= min_protein_length:
                    if strand == 1:
                        start = frame+aa_start*3
                        end = min(seq_len,frame+aa_end*3+3)
                    else:
                        start = seq_len-frame-aa_end*3-3
                        end = seq_len-frame-aa_start*3
                    answer.append((start, end, strand,
                                   trans[aa_start:aa_end]))
                aa_start = aa_end+1
    answer.sort()
    return answer

orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)
for start, end, strand, pro in orf_list:
    print("%s...%s - length %i, strand %i, %i:%i" \
          % (pro[:30], pro[-3:], len(pro), strand, start, end))

And the output:

NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, 41:1109
WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, 491:827
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, 1030:1888
RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, 2830:3190
RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, 3470:3857
GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, 4249:4780
RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, 4814:5900
VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, 5923:6421
LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, 5974:6298
GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, 6654:7602
IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, 7788:8124
WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, 8087:8465
TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, 8741:9044
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, 9264:9609

If you comment out the sort statement, then the protein sequences will be shown in the same order as before, so you can check this is doing the same thing. Here we have sorted them by location to make it easier to compare to the actual annotation in the GenBank file (as visualised in Section [sec:gd_nice_example]).

If however all you want to find are the locations of the open reading frames, then it is a waste of time to translate every possible codon, including doing the reverse complement to search the reverse strand too. All you need to do is search for the possible stop codons (and their reverse complements). Using regular expressions is an obvious approach here (see the Python module re). These are an extremely powerful (but rather complex) way of describing search strings, which are supported in lots of programming languages and also command line tools like grep as well). You can find whole books about this topic!

Sequence parsing plus simple plots

This section shows some more examples of sequence parsing, using the Bio.SeqIO module described in Chapter [chapter:Bio.SeqIO], plus the Python library matplotlib’s pylab plotting interface (see the matplotlib website for a tutorial). Note that to follow these examples you will need matplotlib installed - but without it you can still try the data parsing bits.

Histogram of sequence lengths

There are lots of times when you might want to visualise the distribution of sequence lengths in a dataset – for example the range of contig sizes in a genome assembly project. In this example we’ll reuse our orchid FASTA file ls_orchid.fasta which has only 94 sequences.

First of all, we will use Bio.SeqIO to parse the FASTA file and compile a list of all the sequence lengths. You could do this with a for loop, but I find a list comprehension more pleasing:

In [28]:
from Bio import SeqIO
sizes = [len(rec) for rec in SeqIO.parse("data/ls_orchid.fasta", "fasta")]
len(sizes), min(sizes), max(sizes)
Out[28]:
(94, 572, 789)
In [29]:
sizes
Out[29]:
[740,
 753,
 748,
 744,
 733,
 718,
 730,
 704,
 740,
 709,
 700,
 726,
 753,
 699,
 658,
 752,
 726,
 765,
 755,
 742,
 762,
 745,
 750,
 731,
 741,
 740,
 727,
 711,
 743,
 727,
 757,
 770,
 767,
 759,
 750,
 788,
 774,
 789,
 688,
 719,
 743,
 737,
 728,
 740,
 696,
 732,
 731,
 735,
 720,
 740,
 629,
 572,
 587,
 700,
 636,
 716,
 592,
 716,
 733,
 626,
 737,
 740,
 574,
 594,
 610,
 730,
 641,
 702,
 733,
 738,
 736,
 732,
 745,
 744,
 738,
 739,
 740,
 745,
 695,
 745,
 743,
 730,
 706,
 744,
 742,
 694,
 712,
 715,
 688,
 784,
 721,
 703,
 744,
 592]

Now that we have the lengths of all the genes (as a list of integers), we can use the matplotlib histogram function to display it.

from Bio import SeqIO
sizes = [len(rec) for rec in SeqIO.parse("ls_orchid.fasta", "fasta")]

import pylab
pylab.hist(sizes, bins=20)
pylab.title("%i orchid sequences\nLengths %i to %i" \
            % (len(sizes),min(sizes),max(sizes)))
pylab.xlabel("Sequence length (bp)")
pylab.ylabel("Count")
pylab.show()

That should pop up a new window containing the following graph:

Histogram of orchid sequence lengths.

Fig. 2 Histogram of orchid sequence lengths.

That should pop up a new window containing the graph shown in Figure [fig:seq-len-hist].

Notice that most of these orchid sequences are about \(740\) bp long, and there could be two distinct classes of sequence here with a subset of shorter sequences.

Tip: Rather than using pylab.show() to show the plot in a window, you can also use pylab.savefig(...) to save the figure to a file (e.g. as a PNG or PDF).

Plot of sequence GC%

Another easily calculated quantity of a nucleotide sequence is the GC%. You might want to look at the GC% of all the genes in a bacterial genome for example, and investigate any outliers which could have been recently acquired by horizontal gene transfer. Again, for this example we’ll reuse our orchid FASTA file ls_orchid.fasta.

First of all, we will use Bio.SeqIO to parse the FASTA file and compile a list of all the GC percentages. Again, you could do this with a for loop, but I prefer this:

from Bio import SeqIO
from Bio.SeqUtils import GC

gc_values = sorted(GC(rec.seq) for rec in SeqIO.parse("ls_orchid.fasta", "fasta"))

Having read in each sequence and calculated the GC%, we then sorted them into ascending order. Now we’ll take this list of floating point values and plot them with matplotlib:

import pylab
pylab.plot(gc_values)
pylab.title("%i orchid sequences\nGC%% %0.1f to %0.1f" \
            % (len(gc_values),min(gc_values),max(gc_values)))
pylab.xlabel("Genes")
pylab.ylabel("GC%")
pylab.show()

As in the previous example, that should pop up a new window containing a graph:

Histogram of orchid sequence lengths.

Fig. 3 Histogram of orchid sequence lengths.

As in the previous example, that should pop up a new window with the graph shown in Figure [fig:seq-gc-plot].

If you tried this on the full set of genes from one organism, you’d probably get a much smoother plot than this.

Nucleotide dot plots

A dot plot is a way of visually comparing two nucleotide sequences for similarity to each other. A sliding window is used to compare short sub-sequences to each other, often with a mis-match threshold. Here for simplicity we’ll only look for perfect matches (shown in black

in Figure [fig:nuc-dot-plot]).

in the plot below).

To start off, we’ll need two sequences. For the sake of argument, we’ll just take the first two from our orchid FASTA file ls_orchid.fasta:

from Bio import SeqIO
handle = open("ls_orchid.fasta")
record_iterator = SeqIO.parse(handle, "fasta")
rec_one = next(record_iterator)
rec_two = next(record_iterator)
handle.close()

We’re going to show two approaches. Firstly, a simple naive implementation which compares all the window sized sub-sequences to each other to compiles a similarity matrix. You could construct a matrix or array object, but here we just use a list of lists of booleans created with a nested list comprehension:

window = 7
seq_one = str(rec_one.seq).upper()
seq_two = str(rec_two.seq).upper()
data = [[(seq_one[i:i+window] <> seq_two[j:j+window]) \
        for j in range(len(seq_one)-window)] \
       for i in range(len(seq_two)-window)]

Note that we have not checked for reverse complement matches here. Now we’ll use the matplotlib’s pylab.imshow() function to display this data, first requesting the gray color scheme so this is done in black and white:

import pylab
pylab.gray()
pylab.imshow(data)
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
pylab.show()

That should pop up a new window containing a graph like this:

That should pop up a new window showing the graph in Figure [fig:nuc-dot-plot].

As you might have expected, these two sequences are very similar with a partial line of window sized matches along the diagonal. There are no off diagonal matches which would be indicative of inversions or other interesting events.

The above code works fine on small examples, but there are two problems applying this to larger sequences, which we will address below. First off all, this brute force approach to the all against all comparisons is very slow. Instead, we’ll compile dictionaries mapping the window sized sub-sequences to their locations, and then take the set intersection to find those sub-sequences found in both sequences. This uses more memory, but is much faster. Secondly, the pylab.imshow() function is limited in the size of matrix it can display. As an alternative, we’ll use the pylab.scatter() function.

We start by creating dictionaries mapping the window-sized sub-sequences to locations:

window = 7
dict_one = {}
dict_two = {}
for (seq, section_dict) in [(str(rec_one.seq).upper(), dict_one),
                            (str(rec_two.seq).upper(), dict_two)]:
    for i in range(len(seq)-window):
        section = seq[i:i+window]
        try:
            section_dict[section].append(i)
        except KeyError:
            section_dict[section] = [i]
#Now find any sub-sequences found in both sequences
#(Python 2.3 would require slightly different code here)
matches = set(dict_one).intersection(dict_two)
print("%i unique matches" % len(matches))

In order to use the pylab.scatter() we need separate lists for the \(x\) and \(y\) co-ordinates:

#Create lists of x and y co-ordinates for scatter plot
x = []
y = []
for section in matches:
    for i in dict_one[section]:
        for j in dict_two[section]:
            x.append(i)
            y.append(j)

We are now ready to draw the revised dot plot as a scatter plot:

import pylab
pylab.cla() #clear any prior graph
pylab.gray()
pylab.scatter(x,y)
pylab.xlim(0, len(rec_one)-window)
pylab.ylim(0, len(rec_two)-window)
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
pylab.show()

That should pop up a new window containing a graph like this:

That should pop up a new window showing the graph in Figure [fig:nuc-dot-plot-scatter].

Personally I find this second plot much easier to read! Again note that we have not checked for reverse complement matches here – you could extend this example to do this, and perhaps plot the forward matches in one color and the reverse matches in another.

Plotting the quality scores of sequencing read data

If you are working with second generation sequencing data, you may want to try plotting the quality data. Here is an example using two FASTQ files containing paired end reads, SRR001666_1.fastq for the forward reads, and SRR001666_2.fastq for the reverse reads. These were downloaded from the ENA sequence read archive FTP site (ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_1.fastq.gz and ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_2.fastq.gz), and are from E. coli – see http://www.ebi.ac.uk/ena/data/view/SRR001666 for details. In the following code the pylab.subplot(...) function is used in order to show the forward and reverse qualities on two subplots, side by side. There is also a little bit of code to only plot the first fifty reads.

import pylab
from Bio import SeqIO
for subfigure in [1,2]:
    filename = "SRR001666_%i.fastq" % subfigure
    pylab.subplot(1, 2, subfigure)
    for i,record in enumerate(SeqIO.parse(filename, "fastq")):
        if i >= 50 : break #trick!
        pylab.plot(record.letter_annotations["phred_quality"])
    pylab.ylim(0,45)
    pylab.ylabel("PHRED quality score")
    pylab.xlabel("Position")
pylab.savefig("SRR001666.png")
print("Done")

You should note that we are using the Bio.SeqIO format name fastq here because the NCBI has saved these reads using the standard Sanger FASTQ format with PHRED scores. However, as you might guess from the read lengths, this data was from an Illumina Genome Analyzer and was probably originally in one of the two Solexa/Illumina FASTQ variant file formats instead.

This example uses the pylab.savefig(...) function instead of pylab.show(...), but as mentioned before both are useful.

Quality plot for some paired end reads.

Fig. 4 Quality plot for some paired end reads.

The result is shown in Figure [fig:paired-end-qual-plot].

Here is the result:

Dealing with alignments

This section can been seen as a follow on to Chapter [chapter:Bio.AlignIO].

Calculating summary information

Once you have an alignment, you are very likely going to want to find out information about it. Instead of trying to have all of the functions that can generate information about an alignment in the alignment object itself, we’ve tried to separate out the functionality into separate classes, which act on the alignment.

Getting ready to calculate summary information about an object is quick to do. Let’s say we’ve got an alignment object called alignment, for example read in using Bio.AlignIO.read(...) as described in Chapter [chapter:Bio.AlignIO]. All we need to do to get an object that will calculate summary information is:

from Bio.Align import AlignInfo
summary_align = AlignInfo.SummaryInfo(alignment)

The summary_align object is very useful, and will do the following neat things for you:

  1. Calculate a quick consensus sequence – see section [sec:consensus]
  2. Get a position specific score matrix for the alignment – see section [sec:pssm]
  3. Calculate the information content for the alignment – see section [sec:getting_info_content]
  4. Generate information on substitutions in the alignment – section [sec:sub_matrix] details using this to generate a substitution matrix.

Calculating a quick consensus sequence

The SummaryInfo object, described in section [sec:summary_info], provides functionality to calculate a quick consensus of an alignment. Assuming we’ve got a SummaryInfo object called summary_align we can calculate a consensus by doing:

consensus = summary_align.dumb_consensus()

As the name suggests, this is a really simple consensus calculator, and will just add up all of the residues at each point in the consensus, and if the most common value is higher than some threshold value will add the common residue to the consensus. If it doesn’t reach the threshold, it adds an ambiguity character to the consensus. The returned consensus object is Seq object whose alphabet is inferred from the alphabets of the sequences making up the consensus. So doing a print consensus would give:

consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
...', IUPACAmbiguousDNA())

You can adjust how dumb_consensus works by passing optional parameters:

the threshold
This is the threshold specifying how common a particular residue has to be at a position before it is added. The default is \(0.7\) (meaning \(70\%\)).
the ambiguous character
This is the ambiguity character to use. The default is ’N’.
the consensus alphabet
This is the alphabet to use for the consensus sequence. If an alphabet is not specified than we will try to guess the alphabet based on the alphabets of the sequences in the alignment.

Position Specific Score Matrices

Position specific score matrices (PSSMs) summarize the alignment information in a different way than a consensus, and may be useful for different tasks. Basically, a PSSM is a count matrix. For each column in the alignment, the number of each alphabet letters is counted and totaled. The totals are displayed relative to some representative sequence along the left axis. This sequence may be the consesus sequence, but can also be any sequence in the alignment. For instance for the alignment,

GTATC
AT--C
CTGTC

the PSSM is:

G A T C
    G 1 1 0 1
    T 0 0 3 0
    A 1 1 0 0
    T 0 0 2 0
    C 0 0 0 3

Let’s assume we’ve got an alignment object called c_align. To get a PSSM with the consensus sequence along the side we first get a summary object and calculate the consensus sequence:

summary_align = AlignInfo.SummaryInfo(c_align)
consensus = summary_align.dumb_consensus()

Now, we want to make the PSSM, but ignore any N ambiguity residues when calculating this:

my_pssm = summary_align.pos_specific_score_matrix(consensus,
                                                  chars_to_ignore = ['N'])

Two notes should be made about this:

  1. To maintain strictness with the alphabets, you can only include characters along the top of the PSSM that are in the alphabet of the alignment object. Gaps are not included along the top axis of the PSSM.
  2. The sequence passed to be displayed along the left side of the axis does not need to be the consensus. For instance, if you wanted to display the second sequence in the alignment along this axis, you would need to do:
second_seq = alignment.get_seq_by_num(1)
    my_pssm = summary_align.pos_specific_score_matrix(second_seq
                                                      chars_to_ignore = ['N'])

The command above returns a PSSM object. To print out the PSSM as shown above, we simply need to do a print(my_pssm), which gives:

A   C   G   T
T  0.0 0.0 0.0 7.0
A  7.0 0.0 0.0 0.0
T  0.0 0.0 0.0 7.0
A  7.0 0.0 0.0 0.0
C  0.0 7.0 0.0 0.0
A  7.0 0.0 0.0 0.0
T  0.0 0.0 0.0 7.0
T  1.0 0.0 0.0 6.0
...

You can access any element of the PSSM by subscripting like your_pssm[sequence_number][residue_count_name]. For instance, to get the counts for the ’A’ residue in the second element of the above PSSM you would do:

In [31]:
print(my_pssm[1]["A"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-9d60637622f7> in <module>()
----> 1 print(my_pssm[1]["A"])

NameError: name 'my_pssm' is not defined

The structure of the PSSM class hopefully makes it easy both to access elements and to pretty print the matrix.

Information Content

A potentially useful measure of evolutionary conservation is the information content of a sequence.

A useful introduction to information theory targeted towards molecular biologists can be found at http://www.lecb.ncifcrf.gov/~toms/paper/primer/. For our purposes, we will be looking at the information content of a consesus sequence, or a portion of a consensus sequence. We calculate information content at a particular column in a multiple sequence alignment using the following formula:

\[IC_{j} = \sum_{i=1}^{N_{a}} P_{ij} \mathrm{log}\left(\frac{P_{ij}}{Q_{i}}\right)\]

where:

  • \(IC_{j}\) – The information content for the \(j\)-th column in an alignment.
  • \(N_{a}\) – The number of letters in the alphabet.
  • \(P_{ij}\) – The frequency of a particular letter \(i\) in the \(j\)-th column (i. e. if G occurred 3 out of 6 times in an aligment column, this would be 0.5)
  • \(Q_{i}\) – The expected frequency of a letter \(i\). This is an optional argument, usage of which is left at the user’s discretion. By default, it is automatically assigned to \(0.05 = 1/20\) for a protein alphabet, and \(0.25 = 1/4\) for a nucleic acid alphabet. This is for geting the information content without any assumption of prior distributions. When assuming priors, or when using a non-standard alphabet, you should supply the values for \(Q_{i}\).

Well, now that we have an idea what information content is being calculated in Biopython, let’s look at how to get it for a particular region of the alignment.

First, we need to use our alignment to get an alignment summary object, which we’ll assume is called summary_align (see section [sec:summary_info]) for instructions on how to get this. Once we’ve got this object, calculating the information content for a region is as easy as:

info_content = summary_align.information_content(5, 30,
                                                 chars_to_ignore = ['N'])

Wow, that was much easier then the formula above made it look! The variable info_content now contains a float value specifying the information content over the specified region (from 5 to 30 of the alignment). We specifically ignore the ambiguity residue ’N’ when calculating the information content, since this value is not included in our alphabet (so we shouldn’t be interested in looking at it!).

As mentioned above, we can also calculate relative information content by supplying the expected frequencies:

expect_freq = {
    'A' : .3,
    'G' : .2,
    'T' : .3,
    'C' : .2}

The expected should not be passed as a raw dictionary, but instead by passed as a SubsMat.FreqTable object (see section [sec:freq_table] for more information about FreqTables). The FreqTable object provides a standard for associating the dictionary with an Alphabet, similar to how the Biopython Seq class works.

To create a FreqTable object, from the frequency dictionary you just need to do:

from Bio.Alphabet import IUPAC
from Bio.SubsMat import FreqTable

e_freq_table = FreqTable.FreqTable(expect_freq, FreqTable.FREQ,
                                   IUPAC.unambiguous_dna)

Now that we’ve got that, calculating the relative information content for our region of the alignment is as simple as:

info_content = summary_align.information_content(5, 30,
                                                 e_freq_table = e_freq_table,
                                                 chars_to_ignore = ['N'])

Now, info_content will contain the relative information content over the region in relation to the expected frequencies.

The value return is calculated using base 2 as the logarithm base in the formula above. You can modify this by passing the parameter log_base as the base you want:

info_content = summary_align.information_content(5, 30, log_base = 10,
                                                 chars_to_ignore = ['N'])

Well, now you are ready to calculate information content. If you want to try applying this to some real life problems, it would probably be best to dig into the literature on information content to get an idea of how it is used. Hopefully your digging won’t reveal any mistakes made in coding this function!

Substitution Matrices

Substitution matrices are an extremely important part of everyday bioinformatics work. They provide the scoring terms for classifying how likely two different residues are to substitute for each other. This is essential in doing sequence comparisons. The book “Biological Sequence Analysis” by Durbin et al. provides a really nice introduction to Substitution Matrices and their uses. Some famous substitution matrices are the PAM and BLOSUM series of matrices.

Biopython provides a ton of common substitution matrices, and also provides functionality for creating your own substitution matrices.

Using common substitution matrices

Creating your own substitution matrix from an alignment

A very cool thing that you can do easily with the substitution matrix classes is to create your own substitution matrix from an alignment. In practice, this is normally done with protein alignments. In this example, we’ll first get a Biopython alignment object and then get a summary object to calculate info about the alignment. The file containing protein.aln (also available online here) contains the Clustalw alignment output.

In [33]:
from Bio import AlignIO
from Bio import Alphabet
from Bio.Alphabet import IUPAC
from Bio.Align import AlignInfo
filename = "data/protein.aln"
alpha = Alphabet.Gapped(IUPAC.protein)
c_align = AlignIO.read(filename, "clustal", alphabet=alpha)
summary_align = AlignInfo.SummaryInfo(c_align)

Sections [sec:align_clustal] and [sec:summary_info] contain more information on doing this.

Now that we’ve got our summary_align object, we want to use it to find out the number of times different residues substitute for each other. To make the example more readable, we’ll focus on only amino acids with polar charged side chains. Luckily, this can be done easily when generating a replacement dictionary, by passing in all of the characters that should be ignored. Thus we’ll create a dictionary of replacements for only charged polar amino acids using:

In [34]:
replace_info = summary_align.replacement_dictionary(["G", "A", "V", "L", "I",
"M", "P", "F", "W", "S",
"T", "N", "Q", "Y", "C"])

This information about amino acid replacements is represented as a python dictionary which will look something like (the order can vary):

{('R', 'R'): 2079.0, ('R', 'H'): 17.0, ('R', 'K'): 103.0, ('R', 'E'): 2.0,
('R', 'D'): 2.0, ('H', 'R'): 0, ('D', 'H'): 15.0, ('K', 'K'): 3218.0,
('K', 'H'): 24.0, ('H', 'K'): 8.0, ('E', 'H'): 15.0, ('H', 'H'): 1235.0,
('H', 'E'): 18.0, ('H', 'D'): 0, ('K', 'D'): 0, ('K', 'E'): 9.0,
('D', 'R'): 48.0, ('E', 'R'): 2.0, ('D', 'K'): 1.0, ('E', 'K'): 45.0,
('K', 'R'): 130.0, ('E', 'D'): 241.0, ('E', 'E'): 3305.0,
('D', 'E'): 270.0, ('D', 'D'): 2360.0}

This information gives us our accepted number of replacements, or how often we expect different things to substitute for each other. It turns out, amazingly enough, that this is all of the information we need to go ahead and create a substitution matrix. First, we use the replacement dictionary information to create an Accepted Replacement Matrix (ARM):

In [35]:
from Bio import SubsMat
my_arm = SubsMat.SeqMat(replace_info)

With this accepted replacement matrix, we can go right ahead and create our log odds matrix (i. e. a standard type Substitution Matrix):

In [36]:
my_lom = SubsMat.make_log_odds_matrix(my_arm)

The log odds matrix you create is customizable with the following optional arguments:

  • exp_freq_table – You can pass a table of expected frequencies for each alphabet. If supplied, this will be used instead of the passed accepted replacement matrix when calculate expected replacments.
  • logbase - The base of the logarithm taken to create the log odd matrix. Defaults to base 10.
  • factor - The factor to multiply each matrix entry by. This defaults to 10, which normally makes the matrix numbers easy to work with.
  • round_digit - The digit to round to in the matrix. This defaults to 0 (i. e. no digits).

Once you’ve got your log odds matrix, you can display it prettily using the function print_mat. Doing this on our created matrix gives:

In [37]:
my_lom.print_mat()
D   2
E  -1   1
H  -5  -4   3
K -10  -5  -4   1
R  -4  -8  -4  -2   2
   D   E   H   K   R